Repository for the paper Stealing Malware Classifiers and antivirus at Low False Positive Conditions
In order to generate a surrogate model you need to specify the target, the surrogate type, the sampling method and the dataset. Please see the details below for the allowed values for each parameter.
python model_extraction.py -h
usage: Model extraction using active learning techniques [-h] -d DATA_DIR [-s SEED] [-m METHOD] [-n NUM_QUERIES] [-b BUDGET]
[-e NUM_EPOCHS] [-t {DNN,dualDNN,LGB,SVM}] [-l LOG_DIR]
[-tg {ember,sorel-FCNN,sorel-LGB,AV1,AV2,AV3,AV4}]
[-f {top10families,Adload,WannaCry,Pykse,Azorult,Bancteian,Emotet,Swisyn,Vobfus}]
[--dataset {ember,sorel,AV}] [--fpr FPR]
optional arguments:
-h, --help show this help message and exit
-d DATA_DIR, --data_dir DATA_DIR
Directory that holds the data
-s SEED, --seed SEED Seed for random states
-m METHOD, --method METHOD
entropy, random, medoids, mc_dropout, k-center, ensemble
-n NUM_QUERIES, --num_queries NUM_QUERIES
Number of query rounds
-b BUDGET, --budget BUDGET
Total query budget
-e NUM_EPOCHS, --num_epochs NUM_EPOCHS
Number of training epochs per round
-t {DNN,dualDNN,LGB,SVM}, --type {DNN,dualDNN,LGB,SVM}
Type of surrogate model
-l LOG_DIR, --log_dir LOG_DIR
Where to store the log files with the results
-tg {ember,sorel-FCNN,sorel-LGB,AV1,AV2,AV3,AV4}, --target_model {ember,sorel-FCNN,sorel-LGB,AV1,AV2,AV3,AV4}
Target model
-f {top10families,Adload,WannaCry,Pykse,Azorult,Bancteian,Emotet,Swisyn,Vobfus}, --family {top10families,Adload,WannaCry,Pykse,Azorult,Bancteian,Emotet,Swisyn,Vobfus}
Select top10 families or one specific malware family
--dataset {ember,sorel,AV}
Thief and test dataset
--fpr FPR FPR level for surrogate merics.
The following command will create a LightGBM surrogate model and it will store it in the output folder (/tmp/logs
) along with a log file with the results for each iteration.
python model_extraction.py --data_dir /data/mari/sorel-data --dataset sorel --seed 42 --method medoids --type LGB --target_model sorel-FCNN --num_epochs 1 --num_queries 10 --log_dir "/tmp/logs/" --budget 2500 --fpr 0.006
If you use this code please cite:
@article{RIGAKI2023103192,
title = {Stealing and evading malware classifiers and antivirus at low false positive conditions},
journal = {Computers & Security},
volume = {129},
pages = {103192},
year = {2023},
issn = {0167-4048},
doi = {https://doi.org/10.1016/j.cose.2023.103192},
url = {https://www.sciencedirect.com/science/article/pii/S0167404823001025},
author = {M. Rigaki and S. Garcia},
}