The Triforce consists of an automatized pipeline for processing Hi-C data with different methods.
Developed by: Dulce I. Valdivia + Luis Delaye + Kasia Oktaba
Clone this repository in your working environment:
git clone https://github.com/dulcirena/TAD-triforce.git
- Juicer
- HiCExplorer
- R
- R packages:
- tidyverse
- dplyr
- strucchangeRcpp
- plotly
- ggpubr
- scico
- bedtools
Work in progress
Once you have ran Arrowhead and HiCExplorer to obtain the their corresponding TAD annotation files, it is time to use TRIFORCE to obtain a consensus set of high-confidence TADs.
To run the complete workflow for TAD annotation:
./triforce.sh <WORK_DIR> \
<FILES_FASTQ_DIR> <TAD_SEP_SCORE> \
<RESOLUTION_KB> <PROJECT_NAME> <DISMISS_CHR> \
<FILE_ARROWHEAD_TADS> <FILE_HICEXPLORER_TADS>
Description:
- WORK_DIR: Directory where the out/ directory will be created.
- FILES_FASTQ_DIR: Directory where the fastq files are (in progress, write any path).
- TAD_SEP_SCORE: TAD separation score file computed with HiCExplorer. Should end with tad_score.bm
- RESOLUTION_KB: Resolution of the matrix in kb.
- PROJECT_NAME: ID for the project (do not use spaces)
- DISMISS_CHR: The name of the chromosome you want to dismiss during the analysis.
- FILE_ARROWHEAD: Arrowhead's TAD calling file (e.g. 10000_blocks)
- HICEXPLORER_ARROWHEAD: HiCExplorer's TAD calling file. Should end with domains.bed
See the file src/run_test.sh for a working example.
Output:
All the outputs are stored in the directory WORK_DIR/out/
- structure_CHR.html: An interactive file per chromosome (CHR) showing the breakpoints of the TAD separation score (TAD-SS) according to the structural change analysis (SCA). SCA breaks the TAD-SS in genomic regions that exhibit similar contact trends. The width of each breakpoint (lightgreen) represent its confidence interval.
- confidenceIntervalCHR.tsv: The coordinates of the 5% and 95% CI for each breakpoint for each CHR. The coordinate of the 50% CI is used in the downstream analysis.
- avgSCregion_boxplot.html: Boxplot showing the distribution of the average TAD separation score in each SCA-region. Only the regions above the overall median are kept for downstream analysis.
- domainSizes_files: Distribution of TAD legth in the different steps of the analysis. Usually the TADs computed by the majority vote script produces longer TADs because it merges the regions of consecutive TADs.
- wd_mountains.bed: High-condifence TADs
- wd_valleys.bed: Out of TAD regions
- wd_interestRegions.bed: All regions classified as high-confidence TADs or out of TADs (basically a union of wd_mountains and wd_valleys). Includes a color column for visualization in HiCExplorer.
- region_type_count.pdf : A plot showing the number of regions in each class (Majority Vote, Fuzzy or Out of TAD). This is made before the refinement of majority vote areas as high-confidence TADs.
- size_class.pdf: A plot showing the total length of the genome classified as High-confidence TADs, TADs between fuzzy regions, Fuzzy regions and Out of TAD regions.