Skip to content

ills-montreal/mol-distill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repository contains the code molecular experiments of the paper: "How to distill task-agnostic representations from many teachers?".

⚗️ Distill representations

To train a model using our distillation framework, we first need to dump the embeddings of the teachers (details on the data preparation can be found here: Data preparation.

For the teachers used in our project, the embeddings will be automatically computed and saved upon training. For different teachers, the embeddings should be computed and dumped by the user so that in the data folder, the file: "<model_name>.npy" ("<model_name>_<file_index>.npy" for multi-files) exists.

The training procedure can be launched using the train_gm.py script:

python molDistill/train_gm.py \
  --dataset <dataset_name> \
  --data-path <path_to_dataset> \
  --num-epochs <num_epochs> \
  --embedders-to-simulate <list_of_teachers> \
  --gnn-type <gnn_type> \
  --knifes-config <knifes_config> \
  ...

For a complete list of arguments, please refer to the train_gm.py script. The "knifes-config" argument should be a path to a yaml file containing the arguments of the KNIFEs estimators (see knifes.yaml for an example).

Similarly, L2 and Cosine distillations can be performed using the train_l2.py and train_cos.py scripts, respectively.

🧪 Downstream evaluation

To evaluate the representations learned by the models on downstream tasks, we train an MLP on top of the learnt representations. We use the downstream_eval.py script:

python molDistill/downstream_eval.py \
  --datasets <dataset_names_list> \
  --data-path <path_to_datasets> \
  --embedders <list_of_models> \
  --hidden-dim <MLP_hidden_dim> \
  --n-layers <MLP_n_layers> \
  --n-epochs <num_epochs> \
  --n-runs <number_of_runs> \
  --test # If you want to evaluate the model on the test set
  --save-results # If you want to save the results
  ...

For a complete list of arguments, please refer to the downstream_eval.py script.

🖼️ Paper's figures

The results of each model on the different downstream tasks are available in the downstream_eval folder. All figures found in the paper can be re-obtained using the notebooks in the molDistill/notebooks folder.

To evaluate a new model, add the path of the results of the downstream evaluation to the 'MODELS_TO_EVAL' variable in the 'get_all_results' function.

📟 Data processing

All datasets were pre-processed following the same procedure. Two options are available to process the data, depending on the size of the dataset:

python molDistill/process_tdc_data.py --dataset <dataset_name> --data-path <path_to_dataset>
python molDistill/process_tdc_data_multifiles.py --dataset <dataset_name> --data-path <path_to_dataset> --i0 <initial_index_to_process> --step <datapoints_per_files>

Using the second scripts, the dataset will be split into multiple files, each containing 'step' datapoints.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published