Skip to content

Analysis of the effect of sequence padding on the performance of a hierarchical EC number prediction task.

Notifications You must be signed in to change notification settings

b2slab/padding_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

In this study we aim to analyse the effect that padding the amino acid sequences has on the performance of deep learning models. Specifically, we have chosen a hierarchical EC number prediction task to carry out the analysis. We use three different architectures (only_denses, 1_conv, stack_conv) to check if they affect the padding effect.

The data used in this study consists on all the reviewed protein sequences of the taxonomy Archaea from Uniprot version 2019_05.

The project is almost entirely coded in Python 3.6.7. R version 3.4.4 has been used for building the explanatory linear models (notebooks 07 and 09).

The findings of this analysis are explained in the following article: Lopez-del Rio, A., Martin, M., Perera-Lluna, A. et al. Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Sci Rep 10, 14634 (2020). https://doi.org/10.1038/s41598-020-71450-8.

Structure

  • Functions are defined in the src/ folder (.py and .R files)
  • The workflow of the analysis is applied through Jupyter Notebooks in the notebooks/ folder. The files are preceded by a number that indicates the chronological order of their execution.
  • raw_data/ folder contains the CSV file with reviewed Uniprot entries for Archaea.
  • data/ is the folder where intermediate and final results will be stored when running the notebooks.

Workflow (notebooks)

1. Data preprocessing

  • 00_creating_data.ipynb
  • 01_pre_statistics.ipynb

2. Models training and results processing

  • 02_task1_training.ipynb
  • 03_task1_comparison.ipynb
  • 04_task2_training.ipynb
  • 05_task2_comparison.ipynb

3. Performance metrics and graphical representation

  • 06_comparing_architectures.ipynb
  • 07_linear_models_metrics.ipynb
  • 08_activations_sequences_pca.ipynb
  • 09_linear_models_pcs.ipynb

System requirements

The runs have been executed on the following hardware from the B2SLab (Universitat Politecnica de Catalunya):

  • tob: 8 threads, 32GB RAM, NVIDIA TITAN Xp GeForce GTX 1070

  • lapsus: 12 threads, 32GB RAM, 2 x NVIDIA GeForce GTX 1070

Considerations

  • The absPath variable at the beginning of each notebook and src file should be changed once the repository is cloned to the correct path in each case.
  • Notebooks 02-04 are implemented to analyse only one architecture each time. The block of variables/parameters in the second cell corresponding to the studied architecture should be uncommented (and the rest, commented) prior to running the notebook.
  • In src/Target, all the different types of padding tested in this study are implemented.
  • Execution without GPU of notebooks 02-05 and 08 may require considerable time and it is not recommended.

About

Analysis of the effect of sequence padding on the performance of a hierarchical EC number prediction task.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages