Skip to content

Freya-Cui-2020/VIPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VIPA

1.Dependencies

The users should install the perl packages, including:

Pod::Usage
Getopt::Long
File::Spec
Cwd
Data::Dumper
List::Util
Bio::DB::Fasta
Bio::Seq

VIPA pipelines have serveral sofware dependencies, including:

perl (Version >=5.30.1)
fqtools
java (Version >=1.8.0_242)
trimmmatic (Version >=0.39)
bwa (Version >=0.7.17-r1188)
samtools (Version >=1.9)
blastn (Version >=2.8.1+)
muscle (Version >=3.8.31)
EMBOSS cons (Version >=6.6.0.0)

2.Installation

Download all the files into one folder

3.Prepare sequencing data

Create a rawdata directory to hold fq.gz files

4.Demo

We attached a demo files including both data and result generated by VIPA.

5.VIPA pipeline

5.1 Quality control

Go to the RAWDATA directory and run the following commands step-by-step

ls *.gz|awk '{print $0}'|awk -F '_' '{print ""$1""}'|sort -u >sample.txt
mkdir -p clean unpaired
cat sample.txt|awk '{print "nohup sh trimmomatic.sh "$1" >>"$1".log &"}' >run_trimmomatic.sh
cat sample.txt|awk '{print "nohup sh fqtools.sh "$1" >>"$1".log &"}' >run_fqtools.sh
sh run_trimmomatic.sh &
sh run_fqtools.sh

5.2 Prepare configuration

mkdir -p ../results

Go to the Results directory and prepare the configuration file config.txt

Run the command: perl prepare.pl-c config.txt

Description of config.txt file

  • raw_data_path: Cleandata path
  • result_path: Results the path
  • sample_suffix: The suffix of data, _R or _, _R represents _R1.fastq.gz and _R1.fq.gz, _ represents for _1.fastq.gzand _1.fq.gz
  • ref_merge: Path to the merged fasta of human and viruses
  • bwa_ref_path: bwa software pathway
  • blast_ref_path: blast software pathway
  • hsa_ref_type: human reference versions
  • suffix: the suffix format for data, fastq.gz or fq.gz, fastq.gz represents _R1.fastq.gz and _1.fastq.gz, fq.gz represents for _R1.fq.gz and _1.fq.gz
  • layout: Read mode and lengthes, e.g. SE100, PE150
  • threads: Number of threads required
  • sge: the default is False, when false is selected, the shell script will be delivered nohup, and when true, the shell script will be delivered qsub
  • maxvmem: the max memory requirement.
  • mem:the memory requirenment for each thread of bwa software
  • type: sequence type
  • method: if the library method is based on PCR amplificaion which does no remove duplilcaiton, fill in MIP. Otherwise, fill in RCA, which will remove duplilcaiton after mapping step.
  • EBV: Boolean, the default is False, when true is selected, the integrations located in the repeat regions of EBV will be removed
  • mode: multi multi will generate virus subtypes' integrations, dominate only generate the top one virus subtype's integrations
  • depth: The lowest depth for virus detection
  • coverage: The lowest coverage for virus detection
  • readcounts: The lowest readcounts for virus detection
  • readsupport: The lowest readsupport for virus integration sites
  • flanking: The flanking lengthes of human-viral junction sequences for integration pattern calculatuions
  • picard: The path to the picard jar software

5.3 A Step-to-step protocol of the VIPA pipeline

Then 1_out_all_pre.sh 2_run_all_pre.sh 3_out_all_pipe.sh and 5_work.sh are generated in the current directory

The shell scripts 1 through 5 are executed in order, with 4_run_all_pipe.sh being generated after 3_out_all_pipe.sh is executed

sh 1_out_all_pre.sh  
sh 2_run_all_pre.sh  
sh 3_out_all_pipe.sh  
sh 4_run_all_pipe.sh  
sh 5_work.sh  

Shell scripts 1-4 containes below contents

  • 1_out_all_pre.sh
perl pre_pipe.pl /results/*/pre.config         
Note: generate a_pre.sh
  • 2_run_all_pre.sh
sh /results/*/a_pre.sh   
Note:Execute a_pre.sh in each sample file and the generated files are stored in results/*/pre
  • 3_out_all_pipe.sh
perl out_pipe.pl /results/*/pipe.config    
Note:generate all.sh
  • 4_run_all_pipe.sh
sh /result/\*/hpv*/all.sh   
Note:Execute all.sh in all files containging HPV subfiles, e.g. sample168/hpv16/all.sh. 
This will generate files in the same directory as in all.sh, including below scripts:
  - b_align.sh  This is the alignment script
  - c_deal.sh   This is the script for soft-clip read extraction
  - d_assemble.sh  This is the script to generate junctional sequence
  - e_sdej.sh  This is the script to analyze the SD-EJ pathway
  - f_mh.sh  This is the script to analyze the Microhomologies

5.4 Statistic

perl stat_breakpoints.pl  
cd ../rawdata  
perl data_stat.pl $PWD sample.txt data.stat.xls  
cd ../results  
head */pre/stat.xls|sed ':t;N;s#/pre/stat.xls <==\n#\t#;b t'|sed 's/==> //'|sed '/^$/d'|sed 's/^hpv/\thpv/' >stat.xls  
ls */pre/*metrics|while read l;do a=${l%%/*};echo -ne "\n$a\t";grep "Unknown" $l|awk -F'\t' '{printf $7}';done >dedup.stat  
head */pre/dedup.coverage|sed ':t;N;s#/pre/dedup.coverage <==\n#\t#;b t'|sed 's/==> //'|sed '/^$/d'|sed 's/^hpv/\thpv/' >dedup.coverage  

5.5 Result file and the format description

The path of the files of final results:

The file of data_stat: rawdata/data_stat.xls
The file of stat:results/stat.xls
The file of dedup.coverage:results/dedup.coverage
The file of break_stat:results/break_stat.xls

Format description of the data_stat.xls

1st column is the sample id
2nd column is the Raw reads
3rd column is the Raw bases
4th column is the Raw Q20
5th column is the Raw Q30
6th column is the Clean reads
7th column is the Clean bases
8th column is the Clean ratio
9th column is the Clean Q20
10th column is the Clean Q30

Format description of the stat.xls

1th column is the HPV type
2nd column is the
3rd column is the reads that only mapped to human references
4th column is the pecentage of reads that only mapped to human references 5th column is the unmapped reads
6th column is the pecentage of unmapped reads 7th column is the HPV reads
8th column is the HPV reads
9th column is the HPV depth
10th column is the HPV mapping coverage of depeth over 1X 11th column is the HPV mapping coverage of depeth over 4X 12th column is the HPV mapping coverage of depeth over 10X
13th column is the HPV ratio

Format description of the dedup.coverage

1th column is the sample id
2nd column is the HPV type 3rd column is the Unique reads
4th column is the reads that only mapped to human references
5th column is the pecentage of reads that only mapped to human references
6th column is the unmapped reads
7th column is the pecentage of unmapped reads
8th column is the HPV reads
9th column is the HPV depth
10th column is the HPV mapping coverage of depeth over 1X
11th column is the HPV mapping coverage of depeth over 4X
12th column is the HPV mapping coverage of depeth over 10X
13th column is the HPV mapping coverage of depeth over 30X
14th column is the HPV mapping coverage of depeth over 500X
15th column is the HPV mapping coverage of depeth over 1000X
16th column is the HPV mapping coverage of depeth over 2000X
17th column is the HPV mapping coverage of depeth over 5000X
18th column is the uniform marker for capture sequence data *(Optional) 19th column is the capture efficiency *(Optional)

Format description of the break_stat.xls

1th column is the sampleid_hpv type
2nd column is the HPV break points
3rd column is the HPV reads

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages