This repository contains the code molecular experiments of the paper: "How to distill task-agnostic representations from many teachers?".
To train a model using our distillation framework, we first need to dump the embeddings of the teachers (details on the data preparation can be found here: Data preparation.
For the teachers used in our project, the embeddings will be automatically computed and saved upon training. For different teachers, the embeddings should be computed and dumped by the user so that in the data folder, the file: "<model_name>.npy" ("<model_name>_<file_index>.npy" for multi-files) exists.
The training procedure can be launched using the train_gm.py script:
python molDistill/train_gm.py \
--dataset <dataset_name> \
--data-path <path_to_dataset> \
--num-epochs <num_epochs> \
--embedders-to-simulate <list_of_teachers> \
--gnn-type <gnn_type> \
--knifes-config <knifes_config> \
...
For a complete list of arguments, please refer to the train_gm.py script. The "knifes-config" argument should be a path to a yaml file containing the arguments of the KNIFEs estimators (see knifes.yaml for an example).
Similarly, L2 and Cosine distillations can be performed using the train_l2.py and train_cos.py scripts, respectively.
To evaluate the representations learned by the models on downstream tasks, we train an MLP on top of the learnt representations. We use the downstream_eval.py script:
python molDistill/downstream_eval.py \
--datasets <dataset_names_list> \
--data-path <path_to_datasets> \
--embedders <list_of_models> \
--hidden-dim <MLP_hidden_dim> \
--n-layers <MLP_n_layers> \
--n-epochs <num_epochs> \
--n-runs <number_of_runs> \
--test # If you want to evaluate the model on the test set
--save-results # If you want to save the results
...
For a complete list of arguments, please refer to the downstream_eval.py script.
The results of each model on the different downstream tasks are available in the downstream_eval folder. All figures found in the paper can be re-obtained using the notebooks in the molDistill/notebooks folder.
To evaluate a new model, add the path of the results of the downstream evaluation to the 'MODELS_TO_EVAL' variable in the 'get_all_results' function.
All datasets were pre-processed following the same procedure. Two options are available to process the data, depending on the size of the dataset:
- For small datasets, the data can be processed using the process_tdc_dataset.py script:
python molDistill/process_tdc_data.py --dataset <dataset_name> --data-path <path_to_dataset>
- For large datasets, the data can be processed using the process_tdc_dataset_multifiles.py script:
python molDistill/process_tdc_data_multifiles.py --dataset <dataset_name> --data-path <path_to_dataset> --i0 <initial_index_to_process> --step <datapoints_per_files>
Using the second scripts, the dataset will be split into multiple files, each containing 'step' datapoints.