3D-DNA scaffolds refinement
$ pip install -e .
This will install the python package (and its cli command) hic_hiker
.
(In the future, $ pip install hic_hiker
)
To run HiC-Hiker, you will need to install the following:
3D-DNA, juicer
and their requirementspython3, >=3.5
withnumpy, matplotlib, sklearn, scipy
(basic matrix operation and visualization)pandas, feather
(handling of large datasets)BioPython
(fasta parsing)pysam
(sam parsing)tqdm
(progress bar)matplotlib-scalebar
(to show scalebar on the benchmark matrix plot)
(These will automatically satisfied while installation using pip
)
- RAM >=80GB for human dataset (we tested on 100GB RAM machine)
You have to prepare
- Input files of 3D-DNA:
- contigs:
.fasta
- Hi-C contacts:
.mnd.txt
(generated by Juicer)
- contigs:
- Output files of 3D-DNA:
- scaffold layout:
.final.assembly
You should use not.FINAL.assembly
but.final.assembly
- (chopped contigs:
.final.fasta
)
- scaffold layout:
- An empty directory for workspace (to store intermediate files or results)
Additionally, to run benchmarks with the reference sequence (and to plot the figures on our paper), you will need:
- An Alignment file of contigs to the reference:
.sam
by running:
$ bwa mem -t 32 ../hg38/hg38.fa GSE95797_Hs1.final.fasta > GSE95797_Hs1.final.fasta.sam
where GSE95797_Hs1.final.fasta
is one of the outputs of 3D-DNA. (chopped contigs)
To refine scaffolds,
$ hic_hiker <contigs.fasta> <scaffold_layout.assembly> <contacts.mnd.txt> <workspace directory> -K <K>
where K
is a threshold parameter. The Hi-C contacts whose separation distance is above K
bp will ignored when polishing. For human genome dataset, we used 75000 bp.
If you want to run whole pipeline including benchmarks,
$ hic_hiker <contigs.fasta> <scaffold_layout.assembly> <contacts.mnd.txt> <workspace directory> -K <K> --refsam <final.fasta.sam>
After the process finishes, you will see in workspace directory:
polished.assembly
assembly file with refinement of orientationspolished.fasta
polished chromosome-length scaffold sequences (with no gap added)- Figures on our paper
fig_distribution.png
fig_errorchart.png
(only when you enabled benchmarks)fig_matrix.png
(only when you enabled benchmarks)
$ pip uninstall hic_hiker
HiC-Hiker: A probabilistic model to determine contig orientation in chromosome-length scaffolds by Hi-C (not published)