VIPA

1.Dependencies

The users should install the perl packages, including:

Pod::Usage
Getopt::Long
File::Spec
Cwd
Data::Dumper
List::Util
Bio::DB::Fasta
Bio::Seq

VIPA pipelines have serveral sofware dependencies, including:

perl (Version >=5.30.1)
fqtools
java (Version >=1.8.0_242)
trimmmatic (Version >=0.39)
bwa (Version >=0.7.17-r1188)
samtools (Version >=1.9)
blastn (Version >=2.8.1+)
muscle (Version >=3.8.31)
EMBOSS cons (Version >=6.6.0.0)

2.Installation

Download all the files into one folder

3.Prepare sequencing data

Create a rawdata directory to hold fq.gz files

4.Demo

We attached a demo files including both data and result generated by VIPA.

5.VIPA pipeline

5.1 Quality control

Go to the RAWDATA directory and run the following commands step-by-step

ls *.gz|awk '{print $0}'|awk -F '_' '{print ""$1""}'|sort -u >sample.txt
mkdir -p clean unpaired
cat sample.txt|awk '{print "nohup sh trimmomatic.sh "$1" >>"$1".log &"}' >run_trimmomatic.sh
cat sample.txt|awk '{print "nohup sh fqtools.sh "$1" >>"$1".log &"}' >run_fqtools.sh
sh run_trimmomatic.sh &
sh run_fqtools.sh

5.2 Prepare configuration

mkdir -p ../results

Go to the Results directory and prepare the configuration file config.txt

Run the command: perl prepare.pl-c config.txt

Description of config.txt file

raw_data_path: Cleandata path
result_path: Results the path
sample_suffix: The suffix of data, _R or _, _R represents _R1.fastq.gz and _R1.fq.gz, _ represents for _1.fastq.gzand _1.fq.gz
ref_merge: Path to the merged fasta of human and viruses
bwa_ref_path: bwa software pathway
blast_ref_path: blast software pathway
hsa_ref_type: human reference versions
suffix: the suffix format for data, fastq.gz or fq.gz, fastq.gz represents _R1.fastq.gz and _1.fastq.gz, fq.gz represents for _R1.fq.gz and _1.fq.gz
layout: Read mode and lengthes, e.g. SE100, PE150
threads: Number of threads required
sge: the default is False, when false is selected, the shell script will be delivered nohup, and when true, the shell script will be delivered qsub
maxvmem: the max memory requirement.
mem:the memory requirenment for each thread of bwa software
type: sequence type
method: if the library method is based on PCR amplificaion which does no remove duplilcaiton, fill in MIP. Otherwise, fill in RCA, which will remove duplilcaiton after mapping step.
EBV: Boolean, the default is False, when true is selected, the integrations located in the repeat regions of EBV will be removed
mode: multi multi will generate virus subtypes' integrations, dominate only generate the top one virus subtype's integrations
depth: The lowest depth for virus detection
coverage: The lowest coverage for virus detection
readcounts: The lowest readcounts for virus detection
readsupport: The lowest readsupport for virus integration sites
flanking: The flanking lengthes of human-viral junction sequences for integration pattern calculatuions
picard: The path to the picard jar software

5.3 A Step-to-step protocol of the VIPA pipeline

Then 1_out_all_pre.sh 2_run_all_pre.sh 3_out_all_pipe.sh and 5_work.sh are generated in the current directory

The shell scripts 1 through 5 are executed in order, with 4_run_all_pipe.sh being generated after 3_out_all_pipe.sh is executed

sh 1_out_all_pre.sh  
sh 2_run_all_pre.sh  
sh 3_out_all_pipe.sh  
sh 4_run_all_pipe.sh  
sh 5_work.sh

Shell scripts 1-4 containes below contents

1_out_all_pre.sh

perl pre_pipe.pl /results/*/pre.config         
Note: generate a_pre.sh

2_run_all_pre.sh

sh /results/*/a_pre.sh   
Note:Execute a_pre.sh in each sample file and the generated files are stored in results/*/pre

3_out_all_pipe.sh

perl out_pipe.pl /results/*/pipe.config    
Note:generate all.sh

4_run_all_pipe.sh

sh /result/\*/hpv*/all.sh   
Note:Execute all.sh in all files containging HPV subfiles, e.g. sample168/hpv16/all.sh. 
This will generate files in the same directory as in all.sh, including below scripts:
  - b_align.sh  This is the alignment script
  - c_deal.sh   This is the script for soft-clip read extraction
  - d_assemble.sh  This is the script to generate junctional sequence
  - e_sdej.sh  This is the script to analyze the SD-EJ pathway
  - f_mh.sh  This is the script to analyze the Microhomologies

5.4 Statistic

perl stat_breakpoints.pl  
cd ../rawdata  
perl data_stat.pl $PWD sample.txt data.stat.xls  
cd ../results  
head */pre/stat.xls|sed ':t;N;s#/pre/stat.xls <==\n#\t#;b t'|sed 's/==> //'|sed '/^$/d'|sed 's/^hpv/\thpv/' >stat.xls  
ls */pre/*metrics|while read l;do a=${l%%/*};echo -ne "\n$a\t";grep "Unknown" $l|awk -F'\t' '{printf $7}';done >dedup.stat  
head */pre/dedup.coverage|sed ':t;N;s#/pre/dedup.coverage <==\n#\t#;b t'|sed 's/==> //'|sed '/^$/d'|sed 's/^hpv/\thpv/' >dedup.coverage

5.5 Result file and the format description

The path of the files of final results:

The file of data_stat: rawdata/data_stat.xls
The file of stat:results/stat.xls
The file of dedup.coverage:results/dedup.coverage
The file of break_stat:results/break_stat.xls

Format description of the data_stat.xls

1st column is the sample id
2nd column is the Raw reads
3rd column is the Raw bases
4th column is the Raw Q20
5th column is the Raw Q30
6th column is the Clean reads
7th column is the Clean bases
8th column is the Clean ratio
9th column is the Clean Q20
10th column is the Clean Q30

Format description of the stat.xls

1th column is the HPV type
2nd column is the
3rd column is the reads that only mapped to human references
4th column is the pecentage of reads that only mapped to human references 5th column is the unmapped reads
6th column is the pecentage of unmapped reads 7th column is the HPV reads
8th column is the HPV reads
9th column is the HPV depth
10th column is the HPV mapping coverage of depeth over 1X 11th column is the HPV mapping coverage of depeth over 4X 12th column is the HPV mapping coverage of depeth over 10X
13th column is the HPV ratio

Format description of the dedup.coverage

1th column is the sample id
2nd column is the HPV type 3rd column is the Unique reads
4th column is the reads that only mapped to human references
5th column is the pecentage of reads that only mapped to human references
6th column is the unmapped reads
7th column is the pecentage of unmapped reads
8th column is the HPV reads
9th column is the HPV depth
10th column is the HPV mapping coverage of depeth over 1X
11th column is the HPV mapping coverage of depeth over 4X
12th column is the HPV mapping coverage of depeth over 10X
13th column is the HPV mapping coverage of depeth over 30X
14th column is the HPV mapping coverage of depeth over 500X
15th column is the HPV mapping coverage of depeth over 1000X
16th column is the HPV mapping coverage of depeth over 2000X
17th column is the HPV mapping coverage of depeth over 5000X
18th column is the uniform marker for capture sequence data *(Optional) 19th column is the capture efficiency *(Optional)

Format description of the break_stat.xls

1th column is the sampleid_hpv type
2nd column is the HPV break points
3rd column is the HPV reads

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
humandb		humandb
Get_MHseq		Get_MHseq
MHs2.pl		MHs2.pl
README.md		README.md
add_0_in_bed.pl		add_0_in_bed.pl
annotate_variation.pl		annotate_variation.pl
blast_filter.pl		blast_filter.pl
blast_parser_new.pl		blast_parser_new.pl
changeMH.pl		changeMH.pl
changeMH2html.pl		changeMH2html.pl
config.pl		config.pl
config.txt		config.txt
deal_oneend		deal_oneend
deal_pemerge_sam		deal_pemerge_sam
deal_with_cluster		deal_with_cluster
deal_with_discordant		deal_with_discordant
demo.rar		demo.rar
depth_SD_exon.pl		depth_SD_exon.pl
edit_hsa_virus.site.pl		edit_hsa_virus.site.pl
extract_info_for_MH.pl		extract_info_for_MH.pl
extract_oneend_fa.pl		extract_oneend_fa.pl
extract_softclip_seq.pl		extract_softclip_seq.pl
filter_sam		filter_sam
filter_softclip		filter_softclip
flanking_seq.pl		flanking_seq.pl
fqtools		fqtools
gene_hg38.bed		gene_hg38.bed
get_softclip_seq		get_softclip_seq
merge.pl		merge.pl
out_dedup.pl		out_dedup.pl
out_discordant.pl		out_discordant.pl
out_pipe.pl		out_pipe.pl
out_rnaseq.pl		out_rnaseq.pl
pick_sam.pl		pick_sam.pl
pre_pipe.pl		pre_pipe.pl
prepare.pl		prepare.pl
sdmmej_classification		sdmmej_classification
site.pl		site.pl
site_stat.pl		site_stat.pl
stat_type_distribution		stat_type_distribution
stat_type_distribution_dedup		stat_type_distribution_dedup
trimmomatic-0.39.jar		trimmomatic-0.39.jar
uniq_map.pl		uniq_map.pl
virus_copy_number.pl		virus_copy_number.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VIPA

1.Dependencies

The users should install the perl packages, including:

VIPA pipelines have serveral sofware dependencies, including:

2.Installation

3.Prepare sequencing data

4.Demo

5.VIPA pipeline

5.1 Quality control

5.2 Prepare configuration

Description of config.txt file

5.3 A Step-to-step protocol of the VIPA pipeline

Shell scripts 1-4 containes below contents

5.4 Statistic

5.5 Result file and the format description

The path of the files of final results:

Format description of the data_stat.xls

Format description of the stat.xls

Format description of the dedup.coverage

Format description of the break_stat.xls

About

Releases

Packages

Languages

Freya-Cui-2020/VIPA

Folders and files

Latest commit

History

Repository files navigation

VIPA

1.Dependencies

The users should install the perl packages, including:

VIPA pipelines have serveral sofware dependencies, including:

2.Installation

3.Prepare sequencing data

4.Demo

5.VIPA pipeline

5.1 Quality control

5.2 Prepare configuration

Description of config.txt file

5.3 A Step-to-step protocol of the VIPA pipeline

Shell scripts 1-4 containes below contents

5.4 Statistic

5.5 Result file and the format description

The path of the files of final results:

Format description of the data_stat.xls

Format description of the stat.xls

Format description of the dedup.coverage

Format description of the break_stat.xls

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages