Skip to content

Latest commit

 

History

History
103 lines (75 loc) · 5.79 KB

0.07_infer_variants.md

File metadata and controls

103 lines (75 loc) · 5.79 KB

Pangenome: Infer variants from the graph

Tags: #pangenome #pggb #vg 🏠 home


[! info] Purpose Infer variants from the graph topology.

Dispensable and private nodes embedded in the graph define variants between genomes. We will extract them using vg deconstruct.

https://github.com/vgteam/vg#calling-variants-from-paths-in-the-graph

vg deconstruct

for i in {01..19}; do for ref in $(cat vg/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.paths); do if [ -e deconstruct/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.on.${ref}_ae.vcf.done ] ; then echo "${ref} for ${i} already processed, skipping..."; else echo "vg deconstruct -p ${ref} -r vg/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.snarls -t 48 -v -a -e vg/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.xg > deconstruct/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.on.${ref}_ae.vcf && touch deconstruct/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.on.${ref}_ae.vcf.done"; fi; done; done > deconstruct_commands.sh

extract the LV=0

To get an overall picture rather than capturing all the nested variants inside variants, we will extract the top-level variation bubbles.

# get the header
for i in $(ls deconstruct/*_ae.vcf); do name=$(basename ${i} .vcf); echo "grep '#' ${i} > deconstruct/parse/${name}.lv0.vcf"; done > extract_headers.sh

parallel -j 48 :::: extract_headers.sh

# filter for LV=0
# replace empty field at the end of the line by '.'
# replace empty fields by '.'
# add the parsed content below the header extracted above
for i in $(ls deconstruct/*_ae.vcf); do name=$(basename ${i} .vcf); echo "grep 'LV=0' ${i} | sed 's:\t\$:\t.:g' | perl -pe 's/\t(?=\t)/\t./g' >> deconstruct/parse/${name}.lv0.vcf"; done > extract_lv0.sh

parallel -j 48 :::: extract_lv0.sh

norm

Before classing the variant sites into a category, we will convert the multi-allelic sites into bi-allelic sites. We also normalize to correct the position of the variant and simplify it if possible.

for i in $(ls deconstruct/parse/*lv0.vcf); do name=$(basename ${i} .vcf); ref_name=$(basename ${i} _ae.lv0.vcf | sed 's:^.*smooth.on.::g' | sed 's:.chr.*$::g' | sed 's:hap:Hap:g'); echo "bcftools norm --output deconstruct/parse/${name}.norm.vcf --output-type v --threads 4 -m-any --fasta-ref assembly/${ref_name}.fasta --check-ref w ${i} 2> logs/${name}.norm.err"; done > norm_vcf.sh

parallel -j 6 :::: norm_vcf.sh

filter for SNPs

# extract
## SNPs
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "bcftools view --output-file deconstruct/parse/${name}.snp.vcf --output-type v --threads 4 --types snps ${i} 2> logs/${name}.snp.err"; done > extract_snp.sh

parallel -j 6 :::: extract_snp.sh

filter for INDELs

## INDELs
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "bcftools view --output-file deconstruct/parse/${name}.indel.vcf --output-type v --threads 4 --types indels ${i} 2> logs/${name}.indel.err"; done > extract_indel.sh

parallel -j 6 :::: extract_indel.sh

filter for INS

## INDELs
for i in $(ls deconstruct/parse/*lv0.norm.indel.vcf); do name=$(basename ${i} .indel.vcf); echo "bcftools filter --include 'strlen(REF)<strlen(ALT)' --output deconstruct/parse/${name}.ins.vcf --output-type v --threads 4 ${i} 2> logs/${name}.ins.err"; done > extract_ins.sh

parallel -j 6 :::: extract_ins.sh

filter for DEL

## INDELs
for i in $(ls deconstruct/parse/*lv0.norm.indel.vcf); do name=$(basename ${i} .indel.vcf); echo "bcftools filter --include 'strlen(REF)>strlen(ALT)' --output deconstruct/parse/${name}.del.vcf --output-type v --threads 4 ${i} 2> logs/${name}.del.err"; done > extract_del.sh

parallel -j 6 :::: extract_del.sh

filter for MNPs

## MNPs
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "bcftools view --output-file deconstruct/parse/${name}.mnp.vcf --output-type v --threads 4 --types mnps ${i} 2> logs/${name}.mnp.err"; done > extract_mnp.sh

parallel -j 6 :::: extract_mnp.sh

filter for others

## other
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "grep -vwFf <(cat deconstruct/parse/${name}.snp.vcf deconstruct/parse/${name}.indel.vcf deconstruct/parse/${name}.mnp.vcf) ${i} > deconstruct/parse/${name}.other.vcf 2> logs/${name}.other.err"; done > extract_other.sh

parallel -j 2 :::: extract_other.sh

checkpoint on the number of variants after classification

The sum of bi.snp, bi.indel, bi.mnp, multi and other must be equal to the initial file

# summarize the nb
for i in $(ls deconstruct/parse/*ae.lv0.norm.other.vcf); do name=$(basename ${i} .other.vcf); if [ -e "deconstruct/parse/stats/${name}.stats_nb.txt" ]; then echo "${name} already done, skipping..."; else snp=$(grep -v '#' deconstruct/parse/${name}.snp.vcf | wc -l); ins=$(grep -v '#' deconstruct/parse/${name}.ins.vcf | wc -l); del=$(grep -v '#' deconstruct/parse/${name}.del.vcf | wc -l); mnp=$(grep -v '#' deconstruct/parse/${name}.mnp.vcf | wc -l); other=$(grep -v '#' deconstruct/parse/${name}.other.vcf | wc -l); tot=$(grep -v '#' deconstruct/parse/${name}.vcf | wc -l); full=$(cat deconstruct/parse/${name}.snp.vcf deconstruct/parse/${name}.ins.vcf deconstruct/parse/${name}.del.vcf deconstruct/parse/${name}.mnp.vcf deconstruct/parse/${name}.other.vcf | grep -v '#' | sort -u | wc -l); check=$(if [ "$full" == "$tot" ]; then echo "TRUE"; else echo "FALSE"; fi); echo -e "snp\t${snp}\nins\t${ins}\ndel\t${del}\nmnp\t${mnp}\nother\t${other}\nfull\t${full}\ntot\t${tot}\ncheck\t${check}" > deconstruct/parse/stats/${name}.stats_nb.txt; fi; done

Graph-inferred gene pangenome <- Previous Step | home | Next Step -> Annotate variants