Welcome to the official implementation of the paper Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced Transfer Learning. This work introduces two innovative dataset pruning techniques: Label Mapping (LM) and Feature Mapping (FM), leveraging source-target domain mapping.
You can install the necessary Python packages with:
pip install -r requirements.txt
We remark that to accelerate model training, this code repository is built based on FFCV and we refer its installation instructions to its official website. In this work, we build our argument system via fastargs, and we provide a revised version here. The installation of the latest fastargs is automatically handled by the command above.
We studied 9 commonly used datasets for transfer learning. We use FFCV to accelerate the data loading and preprocessing. For most datasets, we provide the preprocessed data (.beton
) in this link. Please download the data and put them in the data
folder. For the datasets that are not provided, they are automatically downloaded by PyTorch.
For Flowers102, DTD, UCF101, Food101, EuroSAT, OxfordPets, StanfordCars and SUN397, we use datasets split configuration in CoOp. For other datasets we use official ones provided by pytorch.
The source code is organized as follows:
configs
: contains the default parameters for each datasetsrc
: contains the source code for the proposed methodsalgorithm
: contains the mathematical algorithms used for our method or the baselinesauxiliary
: contains the executable files to generate the intermediate results, e.g., the pruned data, the image features, etc.data
: contains the data loader for each datasetexperiments
: contains the main executable files to run the experimentstools
: contains the tools and utilities for the experiments
arguments
: contains the data arguments
In this section, we provide the instructions to reproduce the results in our paper.
We first pretrain the surrogate model (ResNet-18) on ImageNet using the following command:
python src/experiment/imagenet_train_from_scratch.py --config-file configs/imagenet_train_from_scratch/rn18_16.json
You can change the type of the surrogate model by changing the --network.architecture
argument.
We then prune the source dataset by 10% to 90% with a step size of 10% using LM with the following command:
python src/auxiliary/lm_selection_for_imagenet.py --cfg.data_path PATH_TO_DOWNSTREAM_TRAINING_DATA --cfg.source_train_label_path PATH_TO_IMAGNET_TRAINING_LABLE --cfg.source_val_label_path PATH_TO_IMAGNET_VALIDATION_LABLE --cfg.architecture resnet18 --cfg.pretrained_ckpt PATH_TO_PRETRAINED_CKPT --cfg.retain_class_nums 900,800,700,600,500,400,300,200,100 --cfg.write_path files/class_selection/oxfordpets
Please note the first parameter refers to the path to the training data (.beton
file) of the target data. The second and third parameters refer to the path to the generated label index for each data sample of ImageNet. This will be automatically downloaded when downloading the ImageNet .beton
files. Please refer to the dataset section.
You can also generate your own label index files using the src/auxiliary/get_label_and_indices.py
file.
We can also prune the source dataset using FM. Unlike LM, we need to first determine the features of each data sample of both the source and target dataset with the surrogate model. Below we provide an example of how to generate the features of the source dataset.
python src/auxiliary/feature_gen.py --cfg.data_path PATH_TO_IMAGENET_TRAINING_DATA --cfg.dataset imagenet --cfg.architecture resnet18 --cfg.pretrained_ckpt PATH_TO_PRETRAINED_CKPT --cfg.write_path PATH_TO_FEATURES
Next, with the features of the source and target dataset, we can prune the source dataset using FM with the following command:
python src/auxiliary/fm_selection_for_imagenet.py --dataset.src_train_fx_path PATH_TO_SOURCE_TRAINING_FEATURES --dataset.tgt_train_fx_path PATH_TO_TARGET_TRAINING_FEATURES --dataset.src_train_id_path PATH_TO_SOURCE_DATA_CLUSTER_MAPPING --dataset.src_val_id_path PATH_TO_TARGET_DATA_CLUSTER_MAPPING
Note that the first two parameters are generated by the src/auxiliary/feature_gen.py
file. The last two parameters are the clustering results. This in general indicates which cluster each data sample belongs to.
We then pretrain the large model with the pruned source dataset obtained by either LM or FM. We use the same file to pretrain the model as the one used to pretrain the surrogate model. The only difference is that we need to specify the argument --dataset.prune 1
to indicate that the source dataset is pruned. Besides, we need to input the selected training and testing data indices --dataset.indices.training
and --dataset.indices.testing
. Below we provide an example of how to pretrain the ResNet-101 with the pruned source dataset obtained by LM.
python src/experiment/imagenet_train_from_scratch.py --config-file configs/imagenet_train_from_scratch/rn18_16.json --dataset.prune 1 --dataset.indices.training files/class_selection/oxfordpets_flm_train_top${cls_num}.indices --dataset.indices.testing files/class_selection/oxfordpets_flm_val_top${cls_num}.indices
We then finetune the pretrained model on the target dataset. Below we provide an example of how to finetune the pretrained ResNet-101 on OxfordPets.
python src/experiment/imagenet_transfer_to_downstream.py --config-file configs/imagenet_transfer_to_downstream/oxfordpets_rn101_ff.json --dataset.train_path ./data/oxfordpets/ffcv/train_400_10_90.beton --dataset.test_path ./data/oxfordpets/ffcv/test_400_10_90.beton --network.pretrained_ckpt PATH_TO_PRETRINED_CKPT --exp.identifier oxfordpets_rn101_ff
We also provide the option to train the model from scratch on the target dataset. Below we provide an example of how to train the ResNet-101 from scratch on OxfordPets.
python src/experiment/downstream_train_from_scratch.py --config-file configs/downstream_train_from_scratch/oxfordpets_rn101.json --dataset.train_path ../data/oxfordpets/ffcv/train_400_10_90.beton --dataset.test_path ../data/oxfordpets/ffcv/test_400_10_90.beton