SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification.
Fig. 1. SPMap flow diagram. (A) Subsequence profile map construction: subsequences of the proteins in positive training set are clustered to construct subsequence profile map. (B) Classification: constructed profile map is utilized to find the feature space representation of the protein sequence to be classified.It is sequence-based feature extraction tool based on the subsequences profiles obtained from trainin data (in fasta format).
- Users should first form a profile from the respective training dataset.
- Users can then extract protein features using the profile(s).
python runSPMap.py --generateProfile True --path 'input_folder' --fastaFile_P CYT_pos.fasta --minSeqLen 20 --subSeqLen 5 --fastaFile_O CYT_golden_positive.fasta
Table 1: SPMap tool's arguments
Arguments | Description | Values |
---|---|---|
generateProfile | If profile files are to be generated or not, True if the profile file needs to be generated | True or False |
path | path to fasta file directory | default: "input_folder" |
fastaFile_P | fasta file name to construct profiles (fasta file of training data) | default: "CYT_pos.fasta" |
minSeqLen | protein sequences shorter than this value will not be considered | default: 20 |
profileFile | profile file name to be generated | default: "CYT_pos_profile.txt" |
subSeqLen | the length of subsequences | default: 5 |
fastaFile_O | fasta file name whose features will be extracted | default: "CYT_golden_positive.fasta" |
- Özsarı, G., Rifaioglu, A. S., Atakan, A., Doğan, T., Martin, M. J., Çetin Atalay, R., & Atalay, V. (2022). SLPred: a multi-view subcellular localization prediction tool for multi-location human proteins. Bioinformatics.
- Rifaioglu, A. S., Doğan, T., Martin, M. J., Cetin-Atalay, R., & Atalay, V. (2019). DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Scientific reports, 9(1), 1-16.
- Dalkiran, A., Rifaioglu, A. S., Martin, M. J., Cetin-Atalay, R., Atalay, V., & Doğan, T. (2018). ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC bioinformatics, 19(1), 1-13.
- Rifaioglu, A. S., Doğan, T., Saraç, Ö. S., Ersahin, T., Saidi, R., Atalay, M. V., ... & Cetin‐Atalay, R. (2018). Large‐scale automated function prediction of protein sequences and an experimental case study validation on PTEN transcript variants. Proteins: Structure, Function, and Bioinformatics, 86(2), 135-151.