Relative Information Gain: Shannon entropy-based measure of the relative structural conservation in RNA alignments
Structural characterization of RNAs is a dynamic field, offering many modelling possibilities. RNA secondary structure models are usually characterized by an encoding that depicts structural information of the molecule through string representations or graphs. Introducing a re-interpretation of the Shannon Information applied on RNA alignments, we propose a new scoring metric, the Relative Information Gain (RIG). The RIG score is available for any position in an alignment, showing how different levels of detail encoded in the RNA representation can contribute differently to convey structural information.
Computed RIG scores can be found in the RIG folder.
Marco Pietrosanto*, Marta Adinolfi*, Andrea Guarracino*, Fabrizio Ferrè, Gabriele Ausiello, Ilio Vitale, Manuela Helmer-Citterich. Relative Information Gain: Shannon Entropy-Based Measure of the Relative Structural Conservation in RNA Alignments, NAR Genomics and Bioinformatics, 2021 *Shared first authorship
git clone https://github.com/citterich-lab/RIG.git
cd RIG
pip3 install -r requirements.txt
With the following instruction the RIG scores are calculated by applying the zBEAR alphabet and removing redundant primary sequences up to 90% of identity. The computation are executed using a precalculated structural Position Specific Scoring Matrix (sPSSM) available in the repository.
Note: for a complete guide on how to calculate the RIG scores from scratch by applying all the encodings in the alphabets folder and removing redundant primary sequences up to 50%, 62%, and 90% of identity, refer to the next paragraph Complete guide.
To calculate the RIG scores, you just need to specify the sPSSM
file (it can be one of the generated matrices in the
sPSSM folder).
python3 scripts/compute_RIG.py outputs/sPSSMs/zbear_90/rfam_PSSM_dic_zbear_90.pickle.gz
The RIG scores will be written in the zbear_90_RIGs.tsv file.
In the following there are instructions to calculate the RIG scores from scratch by applying all the encodings in the alphabets folder and removing redundant primary sequences up to 50%, 62%, and 90% of identity.
If needed, to create the data for a new Rfam release you need to specify:
- the new gzipped Rfam seed (the last one was Rfam.seed.gz);
- the output directory;
- the path of the RNAfold program (you can find an executable for Linux operative systems in the tools folder);
- the path of the BearEncoder program (you can find the jar in the tools folder).
python3 scripts/prepare_new_rfam_release.py directory/new/gzipped/rfam-seed/Rfam.seed.gz data/Rfam14.2/ tools/RNAfold tools/BearEncoder.jar
Note: the following instructions are for the Linux operating system. Please change the ftp link according to your operating system to download the correct version of blastclust (see ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/).
cd ~
wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-ia32-linux.tar.gz
tar -xvf blast-2.2.26-ia32-linux.tar.gz
To create the BEAR alignments, you need to specify:
- the sequences folder;
- the bear_alignment folder with the structural alignments encoded in BEAR;
- the gapped_fam_dict.pickle.gz file;
- the identity threshold;
- the
seq_threshold
as the minimum number of RNAs in a Rfam family; - the
alphabet
as in the alphabets folder; - the path of the
blastclust
program.
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 90 5 data/alphabets/bear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 90 5 data/alphabets/qbear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 90 5 data/alphabets/zbear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 62 5 data/alphabets/bear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 62 5 data/alphabets/qbear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 62 5 data/alphabets/zbear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 50 5 data/alphabets/bear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 50 5 data/alphabets/qbear.tsv ~/blast-2.2.26/bin/blastclust
python3 scripts/BlustClust_filter_alignment.py data/Rfam14.2/sequences/ data/Rfam14.2/bear/ data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz 50 5 data/alphabets/zbear.tsv ~/blast-2.2.26/bin/blastclust
The alignment will be generated in the alignments folder.
To build RNA Blocks from structural alignments you need to specify:
- the bear_alignment folder with the structural alignments encoded in BEAR;
- the identity threshold;
- the
alphabet
as in the alphabets folder.
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_bear_90 90 alphabets/bear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_qbear_90 90 alphabets/qbear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_zbear_90 90 alphabets/zbear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_bear_62 62 alphabets/bear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_qbear_62 62 alphabets/qbear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_zbear_62 62 alphabets/zbear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_bear_50 50 alphabets/bear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_qbear_50 50 alphabets/qbear.tsv
python3 scripts/make_RNA_Blocks.py outputs/alignments/bear_new_alignment_zbear_50 50 alphabets/zbear.tsv
The RNA Blocks will be built in the RNA_Blocks folder.
To build a MBR you need to specify:
- the RNA_Blocks folder (built in the previous RNA Blocks from Rfam families alignment step)
- the identity threshold;
- the
alphabet
as in the alphabets folder; - the name of the
info_file
that will collect all the information of the built blocks.
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_bear_90 90 data/alphabets/bear.tsv bear_90
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_qbear_90 90 data/alphabets/qbear.tsv qbear_90
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_zbear_90 90 data/alphabets/zbear.tsv zbear_90
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_bear_62 62 data/alphabets/bear.tsv bear_62
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_qbear_62 62 data/alphabets/qbear.tsv qbear_62
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_zbear_62 62 data/alphabets/zbear.tsv zbear_62
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_bear_50 50 data/alphabets/bear.tsv bear_50
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_qbear_50 50 data/alphabets/qbear.tsv qbear_50
python3 scripts/make_MBR.py outputs/RNA_Blocks/blocks_new_bear_zbear_50 50 data/alphabets/zbear.tsv zbear_50
The MBR will be built in the MBRs folder.
Build a structural Position Specific Scoring Matrix (sPSSM) from a new Matrix of Bear encoded RNA (MBR)
To build a sPSSM you need to specify:
- the MBR version (for example,
zbear_90
); - the MBR file (it can be one of the generated MBRs in the MBRs folder);
- the
alphabet
as in the alphabets folder; - the gapped_fam_dict.pickle.gz file.
python3 scripts/make_PSSM.py bear_90 outputs/MBRs/bear_90/MBR_bear_90.tsv data/alphabets/bear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py qbear_90 outputs/MBRs/qbear_90/MBR_qbear_90.tsv data/alphabets/qbear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py zbear_90 outputs/MBRs/zbear_90/MBR_zbear_90.tsv data/alphabets/zbear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py bear_62 outputs/MBRs/bear_62/MBR_bear_62.tsv data/alphabets/bear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py qbear_62 outputs/MBRs/qbear_62/MBR_qbear_62.tsv data/alphabets/qbear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py zbear_62 outputs/MBRs/zbear_62/MBR_zbear_62.tsv data/alphabets/zbear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py bear_50 outputs/MBRs/bear_50/MBR_bear_50.tsv data/alphabets/bear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py qbear_50 outputs/MBRs/qbear_50/MBR_qbear_50.tsv data/alphabets/qbear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
python3 scripts/make_PSSM.py zbear_50 outputs/MBRs/zbear_50/MBR_zbear_50.tsv data/alphabets/zbear.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
The sPSSM will be built in the sPSSM folder.
To calculate the RIG scores, you need to specify:
- the
sPSSM
file (it can be one of the generated matrices in the sPSSM folder).
python3 scripts/compute_RIG.py outputs/sPSSMs/bear_90/rfam_PSSM_dic_bear_90.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/qbear_90/rfam_PSSM_dic_qbear_90.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/zbear_90/rfam_PSSM_dic_zbear_90.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/bear_62/rfam_PSSM_dic_bear_62.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/qbear_62/rfam_PSSM_dic_qbear_62.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/zbear_62/rfam_PSSM_dic_zbear_62.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/bear_50/rfam_PSSM_dic_bear_50.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/qbear_50/rfam_PSSM_dic_qbear_50.pickle.gz
python3 scripts/compute_RIG.py outputs/sPSSMs/zbear_50/rfam_PSSM_dic_zbear_50.pickle.gz
The RIG scores will be written in the RIGs folder.
Execute
python3 scripts/compute_entropy.py data/Rfam14.2/SS_cons/SS_cons_WUSS.tsv data/Rfam14.2/gapped_fam/gapped_fam_dict.pickle.gz
The output will be written in the entropy folder.
Execute
python3 scripts/plot_RIG_with_WUSS_notation.py data/Rfam14.2/SS_cons/SS_cons_WUSS.tsv
The plots will be generated in the RIG_WUSS folder.
Execute
python3 scripts/plot_RIG_minus_EntropyOrRIG.py data/Rfam14.2/SS_cons/SS_cons_WUSS.tsv entropy
The plots will be generated in the RIG_Entropy folder.
Execute
python3 scripts/plot_RIG_minus_EntropyOrRIG.py data/Rfam14.2/SS_cons/SS_cons_WUSS.tsv RIG
The plots will be generated in the RIG_Entropy folder.
Note: download here the source code distribution of R-scape, and follow the installation instructions.
python3 scripts/compute_rscape_power.py data/Rfam14.2/stockholm/ data/Rfam14.2/Rscape ~/path/where/rscape/executable/is/R-scape
Execute
python3 scripts/plot_RIG_and_RscapePower.py data/Rfam14.2/SS_cons/SS_cons_WUSS.tsv data/Rfam14.2/Rscape
The plots will be generated in the RIG_RscapePower folder.
To convert a bear file from fastB
format (Mattei et al., 2015)
to other structural alphabets, execute
python3 scripts/mapping.py data/alphabets/zbear.tsv data/Rfam14.2/bear/RF00001.folded.fastb > RF00001.folded.zbear.fastb