- About
- Installation
- Usage
- About the name
- Additional data
- Software used
- License
- How to cite
- Funding
- Contact
AnnapuRNA is a knowledge-based scoring function designed to evaluate RNA-ligand complex structures, generated by any computational docking method.
Recommended way of AnnapuRNA installation and running is via conda environment under Linux 64 bit (extensively tested on Ubuntu).
- Install miniconda. Please refer to conda manual and install conda version according to your operating system. Please use Python2 version (miniconda2).
- Clone AnnapuRNA repository:
git clone --depth=1 git@github.com:filipsPL/annapurna.git
or fetch a zip package. - Go to the AnnapuRNA directory (typically
cd annapurna
under linux) and restore the conda environment from the yml fileconda env create -f conda-environment.yml
(the complete AnnapuRNA conda environment needs ~1.5 GB of free disk space).
To validate the installation and run tests, please execute annapurna-tests.sh
.
(if you no longer need the AnnapuRNA)
- Remove the directory with the AnnapuRNA code
- remove conda environment:
conda remove --name annapurna --all
. - To verify that the environment was removed, in your terminal window run
conda info --envs
AnnapuRNA was extensively tested under Linux with Ubuntu versions 16.04, 18.04, and 20.04, with latest miniconda2 Miniconda2-py27_4.8.3-Linux-x86_64.sh
.
Singularity image with the AnnapuRNA fast version (containing fast kNN and RF scoring functions) is available in the sylabs cloud: cloud.sylabs.io.
To fetch the latest image directly, run:
singularity pull library://filips/default/annapurna:latest
Sample input files from molecular docking are located in tests/testFiles/
: 1AJU.pdb
- the RNA structure and ARG.sdf
- poses from docking.
conda activate annapurna
mkdir testresults
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_modern -o testresults/output --groupby
Output files:
- Table with scores:
testresults/output.kNN_modern.csv
(scores for all poses) andtestresults/output.kNN_modern.grouped.csv
(best score for each compound from the input file). The AnnapuRNA score is in the last column ("score"). The lower value, the better.
Usage of AnnapuRNA in singularity container is the same as the standalone console version. Please note that the container has a fast version of the scoring function implemented, i.e., kNN and RF. For DL scoring functions, please use the regular version.
singularity exec annapurna.sif annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_modern -o testresults/output --groupby
# commands used in the screen cast
conda activate annapurna
./annapurna.py --help
mkdir testresults
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_modern -o testresults/output --groupby
cd testresults
ls -la
column -t output.kNN_modern.grouped.csv
column -t output.kNN_modern.csv | less
To see or run AnnapuRNA in jupyter-notebook, refer to the sample notebook (please note that this is a notebook with a bash kernel).
PDB format is mandatory, with nucleotide letters assigned to atoms, eg:
ATOM 64 H1 G A 17 -5.322 17.506 1.537 1.00 0.00 H
ATOM 65 H21 G A 17 -5.499 17.205 3.712 1.00 0.00 H
ATOM 66 H22 G A 17 -4.319 17.828 4.843 1.00 0.00 H
ATOM 67 P C A 18 3.269 18.622 4.974 1.00 0.00 P
ATOM 68 OP1 C A 18 3.196 20.073 5.282 1.00 0.00 O
ATOM 69 OP2 C A 18 4.574 17.923 5.091 1.00 0.00 O
ATOM 70 O5' C A 18 2.219 17.861 5.902 1.00 0.00 O
pdb files fetched from the Protein Data Bank should be fine.
AnnapuRNA accepts many file formats, such as sdf, mol2, mol, pdb, or any other understood by the OpenBabel. Extensively tested on sdf files.
Remarks:
- If your input file contains more than one compound (i.e., chemical compound with unique structure), please make sure that each of compounds has an unique name/title.
- Please make sure that the ligands have the desired protonation state.
./start_h2o.sh
.
AnnapuRNA was benchmarked on four different models: 'DL_basic', 'DL_modern', 'kNN_basic', and 'kNN_modern'.
kNN_modern should be a good first shot:
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_modern -o testresults/output --groupby
Please note, that you can specify scoring with more than one models in a single run:
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_basic -m kNN_modern -o testresults/output --groupby
or even all available models:
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m ALL -o testresults/output --groupby
Please pay attention to the optional argument --merge
- which merges predictions from multiple models into a single file.
In addition to those four models, we provide two models of interactions: NB_modern (Naive Bayes) and RF_modern (Random Forests), both trained on 2016 data set (but please note, that the performance wasn't thoroughly tested).
The clustering of poses is optional and is based on the RMSD distance matrix. We implemented three clustering algorithms that take the RMSD distance matrix as an input, namely "AutoDock-like" method (as implemented in the AutoDock/AutoDock Vina) - AD, "SimRNA-like" method (as implemented in ROSETTA/SimRNA programs) - SR, and Affinity Propagation method (AP).
There are three switches defining clustering parameters:
- choosing a clustering method:
--clustering_method {False,AD,SR,AP}
Clustering method. AD = AutoDock-like; SR = SimRNA-
like; AP = Affinity Propagation.
- defining, how many of top scoring poses will be taken for clustering. 1 = all poses, 0.5 = 50% of the best poses etc.:
--cluster_fraction CLUSTERINGFRACTION
Docking poses clustering. Select this fraction of top
scoring poses. 0-1. 0 = do not cluster results
- for AD = AutoDock-like and SR = SimRNA-like clustering methods, define a clustering cut off. 2 Å should be a reasonable starting point.
--cluster_cutoff CLUSTERINGCUTOFF
Docking poses clustering. Use this RMSD cutoff for
clustering. 0 = do not use the RMSD cutoff
For examples, go to the Usage examples section.
For fine-tuning the Affinity Propagation method, go to the Program fine-tuning section.
-o OUTPUTFILENAME
- define the output file name core, eg.,-o testresults/output
will generate results intestresults
dir, with names starting withoutput
.-s, --skip_statistics
- if, for a given complex (ie. RNA + ligand poses) statistics are already calculated (eg., in a previous run), these can be used directly to score poses, without need to re-calculate interactions statistics.--merge
- merge predictions from multiple models into a single file. Useful when using multiple models for scoring.-g, --groupby
- in addition, output scores with a single best score for each compound.
Usually, there is no need to modify these settings.
Ligand contribution weight term
-e ENERGYWEIGHT, --weight_ligand_energy ENERGYWEIGHT
weight for a ligand's energy term. Default: 0.1. 0
(zero) = do not use the energy term.
The total score for RNA-Ligand complex is a sum of two terms:
The score of internal energy of ligand, E_Ligand , is derived from GAFF internal energy of the ligand and is calculated from the formula:
The ligand’s contribution to the final complex score is scaled by the weighting factor w. This parameter was set to 0.1 after optimization in a cross-validation experiment but may be changed by the user via a command-line switch -e
or --weight_ligand_energy
. To turn off the ligand term, set it to zero: -e 0
.
Distance dependent probabilities
-w {False,L-J,linear,1/x,exp,x^2,log}, --weight_distance {False,L-J,linear,1/x,exp,x^2,log}
weight probabilities by distance depending function. False =
don't weight by distance (default)
We evaluated the performance of the scoring functions changes if a distant-dependent weights are applied to the component probability values, calculated for each of the interactions. This transformation expresses the higher contribution of the short-range interactions and lower for the more distant ones. For this purpose we introduced to equation 2 an additional distant-dependent weight factor w(d) (eq. 4):
We implemented three different transforming functions: multiplicative inverse (equation 5), Lennard-Jones-like transformation (equation 6) and linear transformation (equation 7):
By default, this is turned off.
Distance cut off
-d USEDISTANCECUTOFF, --distance_cutoff USEDISTANCECUTOFF
use distance cutoff. 0-10 Å. Default: 10 Å.
Limit the interaction sphere, for which the interactions are calculated, to a given distance. Please note, that the scoring models are trained on interactions collected for 10 Å distance.
Probabilities transformation
-t {False,PMF}, --transform_proba {False,PMF}
transform calculated probabilities. Default: false
The component probabilities can be transformed by applying PMF-like transformation (See: Potential of Mean Force See: Bernauer, RNA, 2011, 17, 1066-1075), expressed as -1*log(p)
, where p is probability of interaction calculated from the ML model.
By default, this is turned off.
🚷 ⛔ For normal use, there is no need to change settings listed below, so please modify it only if you know what you are doing 💥
When working with very big docking files and/or operating on hardware with limited memory, it may be necessary to adjust the chunksize parameter in the program::
chunksize = 2000000 # adjust according to the available RAM memory
By default, AnnapuRNA assumes the H2O ML server is running on the same computer as AnnapuRNA is executed (i.e., the localhost, 127.0.0.1). This can be changed by editing the variable:
h2o_ip = "127.0.0.1"
Enabling centroid calculation for clusters - change averageStructure
variable to True
:
averageStructure = False # default
AP clustering is defined around line 991, with:
af = AffinityPropagation(affinity="precomputed").fit(rmsdMatrix)
For the available options, refer to the scikit-learn manual.
One can modify the AnnapuRNA code to add polar hydrogens to the ligand molecule(s). This feature can be modified by editing the code around lines 355-358:
# remove all hydrogens
# obmol.DeleteHydrogens()
# and add polar only
# obmol.AddHydrogens(True)
Please see the OpenBabel API manual for details: http://openbabel.org/dev-api/classOpenBabel_1_1OBMol.shtml
Manipulating the hydrogentaion process may affect calculation of the ligand term of the total score (and thus the total score).
Here we describe files from scoring with two methods, followed by a clustering:
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_basic -m kNN_modern -o testresults/output -s --overwrite --groupby --merge --cluster_fraction 1.0 --cluster_cutoff 2.0 --clustering_method AD
Files which are generated:
- sdf structural files with cluster representatives. For each scoring method one sdf file is generated):
├── output_clusters__kNN_basic_TOP1.0_RMSD2.0_AD_representatives.sdf
├── output_clusters__kNN_modern_TOP1.0_RMSD2.0_AD_representatives.sdf
score and the original pose number are stored the sdf fields, e.g.:
> <Pose_Number>
137
> <AnnapuRNA Score>
-38.0489073146
- Scores summary for all scoring functions. Scores for all poses (merged.csv files) and best poses (.grouped.merged.csv files).
├── output.merged.csv
├── output.grouped.merged.csv
- scores for all poses (.csv files) and best poses (.grouped.csv files) for each of a scoring method:
├── output.kNN_basic.csv
├── output.kNN_basic.grouped.csv
├── output.kNN_modern.csv
├── output.kNN_modern.grouped.csv
Additional files:
- interaction statistics which were used for calculation of scores:
├── output.csv.bz2
- energy of the ligands:
├── output.ligand_energy.csv.bz2
- cleaned pdb files:
├── output.RNA.clean.pdb
└── output.RNA.clean.simrna.pdb
- score docking results with two kNN methods, output data to
testresults
directory, overwrite if files exist. After scoring perform clustering with AD method, for all poses (--cluster_fraction 1.0
), with 2 Å RMSD cutoff (--cluster_cutoff 2.0
). Also generate a single file with best pose for each compound (--groupby
) and each method (--merge
).
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_basic -m kNN_modern -o testresults/output -s --overwrite --groupby --merge --cluster_fraction 1.0 --cluster_cutoff 2.0 --clustering_method AD
- score docking results with kNN_modern method, output data to
testresults
directory, overwrite if files exist. After scoring perform clustering with AP method, for 50% top scoring poses (--cluster_fraction 0.5
). Also generate a single file with best pose for each compound (--groupby
) and each method (--merge
).
./annapurna.py -r tests/testFiles/1AJU.pdb -l tests/testFiles/ARG.sdf -m kNN_modern -o testresults/output -s --overwrite --groupby --merge --cluster_fraction 0.5 --clustering_method AP
AnnapuRNA was tested on the outputs from the following docking programs:
- rDock and its new fork RxDock
- AutoDock Vina
- iDock
- GOLD
installation under Windows and MacOS.. It should be possible to use AnnapuRNA with conda environment under Windows and MacOS. The limitation is the availability of the Align-it program in the conda channel - currently, it is available only for Linux, thus the user has to obtain and compile the program independently (the source code and instructions are available here).
Annapurna (/ˌænəˈpʊərnəˌ -ˈpɜːr-/; Sanskrit, Nepali, Newar: अन्नपूर्णा) is a massif in the Himalayas in north-central Nepal that includes one peak over 8,000 metres (26,000 ft), thirteen peaks over 7,000 metres (23,000 ft), and sixteen more over 6,000 metres (20,000 ft). The massif is 55 kilometres (34 mi) long, and is bounded by the Kali Gandaki Gorge on the west, the Marshyangdi River on the north and east, and by Pokhara Valley on the south. At the western end, the massif encloses a high basin called the Annapurna Sanctuary. The highest peak of the massif, Annapurna I Main, is the tenth highest mountain in the world at 8,091 metres (26,545 ft) above sea level. Maurice Herzog led a French expedition to its summit through the north face in 1950, making it the first of the eight-thousanders to be climbed and the only 8,000 meter-peak to be conquered on the first try. From Wikipedia, the free encyclopedia.
For additional data presented in the manuscript, please go to the supporting repository.
During the development of the AnnapuRNA, we used several freely available packages for scientific computations. Here we acknowledge and thanks:
- Biopython - a set of freely available tools for biological computation written in Python
- openbabel - a chemical toolbox designed to speak the many languages of chemical data
- numpy - a fundamental package for scientific computing with Python
- pandas - a fast, powerful, flexible and easy to use open source data analysis and manipulation tool
- Machine learning:
- scikit-learn - Machine Learning in Python
- h2o from h2o.ai - version 3.9.1.3501 - a fully open source, distributed in-memory machine learning platform with linear scalability. H2O is licensed with the Apache 2.0 open source license.
- rna-tools (formerly: rna-pdb-tools) by @mmagnus - a toolbox to analyze sequences, structures and simulations of RNA
- seaborn - statistical data visualization
This program is distributed under GNU Lesser General Public License Version 3, 29 June 2007. See the license for the details.
Stefaniak F, Bujnicki JM (2021) AnnapuRNA: A scoring function for predicting RNA-small molecule binding poses. PLoS Comput Biol 17(2): e1008309. https://doi.org/10.1371/journal.pcbi.1008309
Filip Stefaniak, Janusz M. Bujnicki, AnnapuRNA: a scoring function for predicting RNA-small molecule interactions, bioRxiv 2020.09.08.287136; doi: https://doi.org/10.1101/2020.09.08.287136
Funding: This research was supported by the Foundation for Polish Science and the EU European Regional Development Fund (POIR.04.04.00-00-3CF0/16 to J.M.B.). https://www.fnp.org.pl/ The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Laboratory of Bioinformatics and Protein Engineering International Institute of Molecular and Cell Biology in Warsaw ul. Ks. Trojdena 4, 02-109 Warsaw, Poland
Head of the Laboratory: Janusz M. Bujnicki iamb@genesilico.pl