Predicting unrecognized enhancer-mediated genome topology by an ensemble machine learning model
Genome Res. December 2020 30: 1835-1845, doi:10.1101/gr.264606.120
- What can LoopPredictor do?
- Installation
1. Create conda environment
2. Install LoopPredictor - Example usage
1. Classifying loops for known chromatin interaction
2. Predicting loops for unknown cell types
3. Customize model for extensive research
LoopPredictor is an ensemble machine learning model, used to predict enhancer mediated loops in a genome-wide fashion across different cell lines, which is also applicable to different model organisms.
- Annotate current chromatin interactions and classify the loops into "e-p", "e-e", "p-p" and inactivate types through integrating the enrichment of active/inactive histone mark and distance to Transcription Start Sites(TSS), Which helped users to have a keen insight into topology structure of known cell type.
- Predict chromatin interactions for the unknown cell types which lack of 3D profile, LoopPredictor was trained by HiChIP and multi-omics datasets from three cell types (K562, GM12878, HCT116), users only need to feed the multi-omics features of interested cell type into the pre-trained adaptive model, the sensitive predictions helped users to have a preliminary investigation of unknown chromatin interaction.
- Construct customized models for an extensive prediction research. As 3D chromatin detecting technologies are developing at a high speed, LoopPredictor provided an open workframe for users to construct their own model, the features and targets could be organized as user-defined, and the self-adaptive parameters wil be chosen to tarin the model. After training, users can use the customized model to predict the topology structure of interest.
LoopPrediction is built on Python 3 and R 3.6.2. Homer is also needed for the annotation of chromatin regions.
- Prerequisites:
Python(>=3.4.0), R(>=3.6.2), Homer - Dependencies:
python packages: pandas, numpy, scikit-learn, pathos
R packages: stringr, GenomicRanges, TxDb.Hsapiens (download the corresponding TxDb package for the species you want to predict)
The virtual environment of conda was recommended for the installation of LoopPredictor and its dependencies. A virtual environment can be created and (de)activated as follows by using conda:
# create
conda create -n LoopPredictor-env python=3.6 pandas numpy scikit-learn=0.20.3 pathos
# activate
conda activate LoopPredictor-env
# deactivate
conda deactivate
The virtual environment can also be created by using virtualenv.
- After creating and activating the environment, download and install LoopPredictor (lastest version) from github:
git clone https://github.com/bioinfomaticsCSU/LoopPredictor.git
cd LoopPredictor
python setup.py install
- Or install LoopPredictor using pip:
pip install looppredictor
The testing data were available in /example/K562_classification_example. The structure of folder was shown as below, two kinds of necessary files need to be prepared as input, and the classification results will be *_Output.txt.
example /
K562_classification_example /
featureData / # [necessary input]features of corresonding cell line input for the classification
tmp / # [intermediate]temporary files generated within running
log / # [intermediate]log files generated within running
*.bedpe # [necessary input]loops file with .bedpe format
*_Output.txt # [output]results of loops classification
The classification was taken by the integration of active/inactive histone mark, so the corresonding ChIP-seq peaks of H3K27ac, H3K4me1, and H3K4me3 were the basic requirement. The peak files should be the standard ENCODE narrowPeak/broadPeak file without head line, shown as below, which were listed in the folder /example/K562_classification_exampleas/featureData.
chr22 16843445 16868802 . 322 . 2.120582 13.1 -1
chr22 17024793 17024896 . 985 . 11.483429 2.8 -1
chr22 17038424 17038594 . 854 . 9.633610 5.0 -1
chr22 17050044 17050593 . 465 . 4.143174 2.0 -1
chr22 17050418 17050537 . 984 . 11.468583 4.2 -1
chr22 17066392 17067403 . 892 . 10.169340 14.8 -1
chr22 17067959 17068242 . 878 . 9.966456 13.3 -1
chr22 17068652 17068827 . 835 . 9.358364 4.9 -1
The loops file should be .bedpe format with at least 6 columns, columns were seperated by tab. The minimum columns should include the chrom name, start, end of each anchor, shown as below.
chr22 38290514 38294289 chr22 38680609 38682339
chr5 96033605 96042289 chr5 96259190 96260539
chr1 23665194 23673403 chr1 24097536 24108995
chr3 176676833 176679516 chr3 176741264 176748850
chr11 63604132 63609924 chr11 63751693 63756217
chr17 37005563 37012402 chr17 38801324 38806978
chr3 138311141 138315068 chr3 138482903 138484460
chr11 126078482 126084019 chr11 126210767 126227804
The parameters of the script are as following,
classifyloops -l <loops> -f <featurePath> -g <genome> -o <output_name> -i <integer>
-l [string] loop file with ./bedpe format to be classified.
-f [string] absolute path of the featureData folder.
-g [string] genome of loops.
-o [string] path to save the output result.
-i [integer(1-5)] 1. Output all types of loops without filtering;
2. Output only "e-p"/"p-e" type loops;
3. Output only "e-e" type loops;
4. Output only "p-p" type loops;
5. Output only "inactivate-*"/"*-inativate" type loops;
Here is a running example:
classifyloops -l /path/to/example/K562_classification_example/K562_classifyLoop_example.bedpe \
-f /path/to/example/K562_classification_example/featureData \
-o /path/to/example/K562_classification_example -g hg19 -i 1
The testing data were available in /example/NIH3T3_prediction_example. The structure of folder was shown as below, the featureData is the necessary input for the prediction, and if you want to investigate the topology structures of a set of specific genes, a bed file containing the regions of genes could be provided. If there is no bed file inputing, the prediction will be performed for the whole genome in de novo, which will take some time.
example /
NIH3T3_prediction_example /
featureData / # [necessary input]features of corresonding cell line input for the prediction
tmp3 / # [intermediate]temporary files generated within running
log / # [intermediate]log files generated within running
*.bed # [optional input]input bed file of interested regions for loops prediction
feature_out.txt # [intermediate]features are extracted from files within running
*predicted_result.bedpe # [output]predicted results of loops
We provided three typical pre-trained model for the prediction, which could be found in folder trained_model. The model should be chosen to match the features you can get access to.
pre-trained model | multi-omics features requirement |
---|---|
Minimum model | -ATAC-seq, -ChIP-seq/CUT&RUN(H3K27ac,H3K4me3) |
Median model | -ATAC-seq, -ChIP-seq/CUT&RUN(H3K27ac,H3K4me3,H3K4me1,H3K9ac,H3K9me3,CTCF), -RNA-seq |
Maximum model | -ATAC-seq, -ChIP-seq/CUT&RUN(H3K27ac,H3K4me3,H3k4me2,H3K4me1,H3K9ac,H3K9me3,H3K36me3,H3K79me2,CTCF,ELF1,JUND,MAX,YY1), -RNA-seq, -Methylation |
The multi-omics features data should be put into folder featureData/. For ATAC-seq and ChIP-seq/CUT&RUN data, the format should be standard narrowPeak/broadPeak, as shown above. For RNA-seq data, we recommend to use Homer to build tag files for the alignment, and then "findPeaks" function is utilized to detect the highly enriched regions, the output from "findPeaks" can be used as the feature file of RNA-seq data after removing the head line. For Methylation data, we recommend to download the .bedRrbs format of RRBS data from ENCODE, which is shown as below,
chr1 1000170 1000171 K562_Rep3_RRBS 46 + 1000170 1000171 155,255,0 46 35
chr1 1000190 1000191 K562_Rep3_RRBS 46 + 1000190 1000191 105,255,0 46 15
chr1 1000191 1000192 K562_Rep3_RRBS 53 - 1000191 1000192 55,255,0 53 9
chr1 1000198 1000199 K562_Rep3_RRBS 46 + 1000198 1000199 105,255,0 46 20
chr1 1000199 1000200 K562_Rep3_RRBS 53 - 1000199 1000200 105,255,0 53 15
chr1 1000206 1000207 K562_Rep3_RRBS 53 - 1000206 1000207 155,255,0 53 26
If you want to detect the enhancer-mediated interactions for a set of interested gene, we recommend to extract the name of genes to the coordinates on the chromatin by UCSC Table Browser. The coordinates should be ./bed format with at least 3 columns(chrom,start,end).
After preparing the input files, you can run the script "LoopPredictor.py" to perform the prediction. The parameters of the script are as following,
looppredictor -b <bedfile> -f <featurePath> -g <genome> -t <trainfile> -m <model> -c <cutoff> -o <output_name>
-b [string] coordinate .bed file of a set of interested genes.
-f [string] absolute path of the featureData folder.
-g [string] genome of the features data.
-t [string] .fix file for the pre-trained model.
-m [string] the pre-trained model chosen to use.
-c [integer(>0)] the cutoff of loop score to filter the predicted output loops.
-o [string] path to save the output result.
Here is a running example:
looppredictor -b /path/to/example/NIH3T3_prediction_example/NIH_geneEnh_example.bed \
-f /path/to/example/NIH3T3_prediction_example/featureData \
-g mm10 \
-t /path/to/trained_model/features_median_forTraining.fix \
-m /path/to/trained_model/GBRT_trained_model_median.m \
-c 1 \
-o /path/to/example/NIH3T3_prediction_example
- Step1: prepare trianing data
The training data contains target and multi-omics data, target file shoule be the chromatin interactions in ./bedpe format, which is the prior knowledge to train the model. The target file should be 8 columns without head line, the final column must be the score of interaction, which is important for the prediction, shown as below,
chr17 75115676 75125670 chr17 76731709 76734226 loop_name_1 1
chr12 56649935 56654417 chr12 56860478 56865446 loop_name_2 17
chr18 53026973 53030447 chr18 54302527 54308273 loop_name_3 2
chr10 15644687 15646877 chr10 15825760 15862181 loop_name_4 7
chr21 35439250 35465483 chr21 36693517 36722679 loop_name_5 3
The multi-omics data for the corresponding cell line should be prepared in the featureData/ folder as mentioned above.
- Step2: Running training workflow
After preparing the input training data, you can run the script "Customized_GBRT_trainer.py" to train your own model. The parameters of the script are as following,
customized_gbrt_trainer -t <trainfile> -f <feature> -g <genome> -o <output_path> -n <output_name>
-t [string] target file of a set of loops with score to train the model, which is .bedpe format.
-f [string] absolute path of the featureData folder.
-g [string] genome of the features data.
-o [string] path to save the output result.
-n [string] output name of the model.
Here is a running example:
customized_gbrt_trainer -t /path/to/example/HCT116_custom_model_example/HCT116_custom_example.bedpe \
-f /path/to/example/HCT116_custom_model_example/featureData \
-g hg19 \
-o /path/to/example/NIH3T3_prediction_example \
-n HCT116_custom_model