MalwareLogAnalysis

Malware log analysis based on Branch data.

Relative Presentation is here.

Branch data refers to the data processed at a branching situation such as jmp and call. This data is advantageous for showing the structure of a binary regardless of the polymorphism.

BranchLogPreprocess.ipynb preprocess the raw branch logs to regularized data. Then MalwareLogAnalysis.ipynb classify the branch data into malware and normal software.

Analysis : Processing utils for log analysis.
Cuckoo : Customization of Cuckoo sandbox for automation of the malware branching.
Data : API List and ML dataset.
Log : Raw logs of the branch data.
BranchTracer : Branch tracer based on VEH for logging branch data.

Analysis

Processing utils for log analysis.

maldb.py : Create database for the branch logs.
preproc.py : Log regularizer for feeding ML model.

Malware database

id	name
ID	Malware Name based on VirusTotal

Branch database

Column	Comment
id	ID
malware_id	Malware ID
order	Order of the branch data generated from the same malware
src_addr	Source Address
dst_addr	Destination Address
dll	Destination DLL Space (Nullable)
symbol	Destination Symbol Data (Nullable)

Preprocessing

Filter out API symbol data and map to symbol index given by function_list.txt.

For example, calc's preprocessed branch data is [0,186,0,143,187,0,292]. First column is is_malware flag. if 1 then malware else normal software. Following column is the one-based index of the API based on function_list.txt. If index is 0, it represent the function_list.txt doesn't have such API symbol.

Cuckoo

Customization of Cuckoo sandbox for automation of the malware branching.

It is a schematic representation of the structure of the Cuckoo Sandbox. When we submit a malware to the Cuckoo sandbox, scheduler recieve the malware. It sent the malware to the vm or put it into the queue. The agent.py of the VM receive it and start the analyzer.py to analyze malware. It makes the report and sent it to the scheduler.

I customize the analyzer.py to run the branch tracer. Helper injects the Brancher dll to the target process and Brancher logs the branch data.

if is32bit:
    self.target = 'C:\\dbg\\Helper32.exe'
else:
    self.target = 'C:\\dbg\\Helper64.exe'

try:
    proc = Popen(self.target)
    pids = proc.pid
except Exception as e:
    log.error('custom : fail to open process %s : %s', self.target, e)

After run the software, analyzer.py preprocess the log and write it on the debug log of the Cuckoo sandbox.

time.sleep(3)
with open('C:\\dbg\\log.txt') as f_log:
    raw = f_log.read()
    data = ''.join(raw.split('\x00'))
    log.debug('logged : \n%s', data)

The ./cuckoo/filter.py parse the branch data and make a new log file of the software's log data.

with open(analysis_path % filt) as f:
    log = f.read()

if '+' not in log:
    shutil.rmtree('./' + filt)
else:
    liner = log.replace('\r', '').split('\n')
    branch = filter(lambda x: '+' in x, liner)
    data = '\n'.join(branch)

    with open(log_path % filt, 'w') as branch_log:
        branch_log.write(data + '\n')

Data

Top 1000 Windows API called by malware samples.

I collect 470 malware branch data and 40 normal software branch data. It's very unbalanced classification problem, so I make dataset seperately mal_trainset.csv (450), norm_trainset.csv (20) except testset.csv (20+20). Then train 20 malware trainset and whole normal trainset as a one batch.

Branch Tracer

Branch Tracer based on Vectored Exception Handler. Here is my repo.