Malware log analysis based on Branch data.
Relative Presentation is here.
Branch data refers to the data processed at a branching situation such as jmp and call. This data is advantageous for showing the structure of a binary regardless of the polymorphism.
BranchLogPreprocess.ipynb preprocess the raw branch logs to regularized data. Then MalwareLogAnalysis.ipynb classify the branch data into malware and normal software.
- Analysis : Processing utils for log analysis.
- Cuckoo : Customization of Cuckoo sandbox for automation of the malware branching.
- Data : API List and ML dataset.
- Log : Raw logs of the branch data.
- BranchTracer : Branch tracer based on VEH for logging branch data.
Processing utils for log analysis.
- maldb.py : Create database for the branch logs.
- preproc.py : Log regularizer for feeding ML model.
Malware database
id | name |
---|---|
ID | Malware Name based on VirusTotal |
Branch database
Column | Comment |
---|---|
id | ID |
malware_id | Malware ID |
order | Order of the branch data generated from the same malware |
src_addr | Source Address |
dst_addr | Destination Address |
dll | Destination DLL Space (Nullable) |
symbol | Destination Symbol Data (Nullable) |
Preprocessing
Filter out API symbol data and map to symbol index given by function_list.txt
.
For example, calc's preprocessed branch data is [0,186,0,143,187,0,292]. First column is is_malware
flag. if 1 then malware else normal software. Following column is the one-based index of the API based on function_list.txt
. If index is 0, it represent the function_list.txt doesn't have such API symbol.
Customization of Cuckoo sandbox for automation of the malware branching.
It is a schematic representation of the structure of the Cuckoo Sandbox. When we submit a malware to the Cuckoo sandbox, scheduler recieve the malware. It sent the malware to the vm or put it into the queue. The agent.py
of the VM receive it and start the analyzer.py
to analyze malware. It makes the report and sent it to the scheduler.
I customize the analyzer.py
to run the branch tracer. Helper injects the Brancher dll to the target process and Brancher logs the branch data.
if is32bit:
self.target = 'C:\\dbg\\Helper32.exe'
else:
self.target = 'C:\\dbg\\Helper64.exe'
try:
proc = Popen(self.target)
pids = proc.pid
except Exception as e:
log.error('custom : fail to open process %s : %s', self.target, e)
After run the software, analyzer.py
preprocess the log and write it on the debug log of the Cuckoo sandbox.
time.sleep(3)
with open('C:\\dbg\\log.txt') as f_log:
raw = f_log.read()
data = ''.join(raw.split('\x00'))
log.debug('logged : \n%s', data)
The ./cuckoo/filter.py
parse the branch data and make a new log file of the software's log data.
with open(analysis_path % filt) as f:
log = f.read()
if '+' not in log:
shutil.rmtree('./' + filt)
else:
liner = log.replace('\r', '').split('\n')
branch = filter(lambda x: '+' in x, liner)
data = '\n'.join(branch)
with open(log_path % filt, 'w') as branch_log:
branch_log.write(data + '\n')
Top 1000 Windows API called by malware samples.
I collect 470 malware branch data and 40 normal software branch data. It's very unbalanced classification problem, so I make dataset seperately mal_trainset.csv (450)
, norm_trainset.csv (20)
except testset.csv (20+20)
. Then train 20 malware trainset and whole normal trainset as a one batch.
Branch Tracer based on Vectored Exception Handler. Here is my repo.
Classification problem between malware and normal software. Dataset is the preprocessed branch data.
First, project the symbol index number to 1024 vectors. Then pass it to the LSTM reucrrent unit of 1024 hidden units. Connect it to Fully-connected layer (512, 256, 2) and softmax the result.
I was able to get a classifier with 92% accuracy.
Unlike existing detection method, it use branch data to detect malware. Branch data represent a structure as a small view, behavior as a larger view of a binary, and it can carry on the definition that malware do malicious acts.
Theoritically, if I can extract the branch data, it will be able to detect most malware.