SPMap: Subsequence-based feature map for protein function classification

SPMap takes into account the information coming from the subsequences of a protein. A group of protein sequences that belong to the same level of classification is decomposed into fixed-length subsequences and they are clustered to obtain a representative feature space mapping. Mapping is defined as the distribution of the subsequences of a protein sequence over these clusters. The resulting feature space representation is used to train discriminative classifiers for functional families. The aim of this approach is to incorporate information coming from important subregions that are conserved over a family of proteins while avoiding the difficult task of explicit motif identification.

Fig. 1. SPMap flow diagram. (A) Subsequence profile map construction: subsequences of the proteins in positive training set are clustered to construct subsequence profile map. (B) Classification: constructed profile map is utilized to find the feature space representation of the protein sequence to be classified.

SPMap tool:

It is sequence-based feature extraction tool based on the subsequences profiles obtained from trainin data (in fasta format).

Users should first form a profile from the respective training dataset.
Users can then extract protein features using the profile(s).

How to use:

 python runSPMap.py --generateProfile True --path 'input_folder' --fastaFile_P CYT_pos.fasta --minSeqLen 20 --subSeqLen 5 --fastaFile_O CYT_golden_positive.fasta

Table 1: SPMap tool's arguments

Arguments	Description	Values
generateProfile	If profile files are to be generated or not, True if the profile file needs to be generated	True or False
path	path to fasta file directory	default: "input_folder"
fastaFile_P	fasta file name to construct profiles (fasta file of training data)	default: "CYT_pos.fasta"
minSeqLen	protein sequences shorter than this value will not be considered	default: 20
profileFile	profile file name to be generated	default: "CYT_pos_profile.txt"
subSeqLen	the length of subsequences	default: 5
fastaFile_O	fasta file name whose features will be extracted	default: "CYT_golden_positive.fasta"

References

Sarac, O. S., Gürsoy-Yüzügüllü, Ö., Cetin-Atalay, R., & Atalay, V. (2008). Subsequence-based feature map for protein function classification. Computational biology and chemistry, 32(2), 122-130.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
images		images
input_folder		input_folder
output_folder		output_folder
profiles		profiles
LICENSE		LICENSE
README.md		README.md
blo62.csv		blo62.csv
generateFeatureVectors.py		generateFeatureVectors.py
generateProfiles.py		generateProfiles.py
runSPMap.py		runSPMap.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPMap: Subsequence-based feature map for protein function classification

SPMap tool:

How to use:

References

Our studies that we used SPMap:

About

Releases

Packages

Languages

License

gozsari/SPMap_Tool

Folders and files

Latest commit

History

Repository files navigation

SPMap: Subsequence-based feature map for protein function classification

SPMap tool:

How to use:

References

Our studies that we used SPMap:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages