Skip to content

nababora/MalwareLogAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MalwareLogAnalysis

Malware log analysis based on Branch data.

Relative Presentation is here.

Branch data refers to the data processed at a branching situation such as jmp and call. This data is advantageous for showing the structure of a binary regardless of the polymorphism.

BranchLogPreprocess.ipynb preprocess the raw branch logs to regularized data. Then MalwareLogAnalysis.ipynb classify the branch data into malware and normal software.

  • Analysis : Processing utils for log analysis.
  • Cuckoo : Customization of Cuckoo sandbox for automation of the malware branching.
  • Data : API List and ML dataset.
  • Log : Raw logs of the branch data.
  • BranchTracer : Branch tracer based on VEH for logging branch data.

Analysis

Processing utils for log analysis.

  • maldb.py : Create database for the branch logs.
  • preproc.py : Log regularizer for feeding ML model.

Malware database

id name
ID Malware Name based on VirusTotal

Branch database

Column Comment
id ID
malware_id Malware ID
order Order of the branch data generated from the same malware
src_addr Source Address
dst_addr Destination Address
dll Destination DLL Space (Nullable)
symbol Destination Symbol Data (Nullable)

Preprocessing

Filter out API symbol data and map to symbol index given by function_list.txt.

For example, calc's preprocessed branch data is [0,186,0,143,187,0,292]. First column is is_malware flag. if 1 then malware else normal software. Following column is the one-based index of the API based on function_list.txt. If index is 0, it represent the function_list.txt doesn't have such API symbol.

Cuckoo

Customization of Cuckoo sandbox for automation of the malware branching.

It is a schematic representation of the structure of the Cuckoo Sandbox. When we submit a malware to the Cuckoo sandbox, scheduler recieve the malware. It sent the malware to the vm or put it into the queue. The agent.py of the VM receive it and start the analyzer.py to analyze malware. It makes the report and sent it to the scheduler.

I customize the analyzer.py to run the branch tracer. Helper injects the Brancher dll to the target process and Brancher logs the branch data.

if is32bit:
    self.target = 'C:\\dbg\\Helper32.exe'
else:
    self.target = 'C:\\dbg\\Helper64.exe'

try:
    proc = Popen(self.target)
    pids = proc.pid
except Exception as e:
    log.error('custom : fail to open process %s : %s', self.target, e)

After run the software, analyzer.py preprocess the log and write it on the debug log of the Cuckoo sandbox.

time.sleep(3)
with open('C:\\dbg\\log.txt') as f_log:
    raw = f_log.read()
    data = ''.join(raw.split('\x00'))
    log.debug('logged : \n%s', data)

The ./cuckoo/filter.py parse the branch data and make a new log file of the software's log data.

with open(analysis_path % filt) as f:
    log = f.read()

if '+' not in log:
    shutil.rmtree('./' + filt)
else:
    liner = log.replace('\r', '').split('\n')
    branch = filter(lambda x: '+' in x, liner)
    data = '\n'.join(branch)

    with open(log_path % filt, 'w') as branch_log:
        branch_log.write(data + '\n')

Data

Top 1000 Windows API called by malware samples.

I collect 470 malware branch data and 40 normal software branch data. It's very unbalanced classification problem, so I make dataset seperately mal_trainset.csv (450), norm_trainset.csv (20) except testset.csv (20+20). Then train 20 malware trainset and whole normal trainset as a one batch.

Branch Tracer

Branch Tracer based on Vectored Exception Handler. Here is my repo.

MalwareLogAnalysis

Classification problem between malware and normal software. Dataset is the preprocessed branch data.

First, project the symbol index number to 1024 vectors. Then pass it to the LSTM reucrrent unit of 1024 hidden units. Connect it to Fully-connected layer (512, 256, 2) and softmax the result.

I was able to get a classifier with 92% accuracy.

Benefit

Unlike existing detection method, it use branch data to detect malware. Branch data represent a structure as a small view, behavior as a larger view of a binary, and it can carry on the definition that malware do malicious acts.

Theoritically, if I can extract the branch data, it will be able to detect most malware.

About

Malware log analysis based on Branch Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published