Tags: #pangenome #pggb #vg 🏠 home
[! info] Purpose Infer variants from the graph topology.
Dispensable and private nodes embedded in the graph define variants between genomes. We will extract them using vg deconstruct.
https://github.com/vgteam/vg#calling-variants-from-paths-in-the-graph
for i in {01..19}; do for ref in $(cat vg/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.paths); do if [ -e deconstruct/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.on.${ref}_ae.vcf.done ] ; then echo "${ref} for ${i} already processed, skipping..."; else echo "vg deconstruct -p ${ref} -r vg/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.snarls -t 48 -v -a -e vg/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.xg > deconstruct/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.on.${ref}_ae.vcf && touch deconstruct/all.on.all.wfmash_s10000p85n1.chr${i}.seqwish_k49.smooth.on.${ref}_ae.vcf.done"; fi; done; done > deconstruct_commands.sh
To get an overall picture rather than capturing all the nested variants inside variants, we will extract the top-level variation bubbles.
# get the header
for i in $(ls deconstruct/*_ae.vcf); do name=$(basename ${i} .vcf); echo "grep '#' ${i} > deconstruct/parse/${name}.lv0.vcf"; done > extract_headers.sh
parallel -j 48 :::: extract_headers.sh
# filter for LV=0
# replace empty field at the end of the line by '.'
# replace empty fields by '.'
# add the parsed content below the header extracted above
for i in $(ls deconstruct/*_ae.vcf); do name=$(basename ${i} .vcf); echo "grep 'LV=0' ${i} | sed 's:\t\$:\t.:g' | perl -pe 's/\t(?=\t)/\t./g' >> deconstruct/parse/${name}.lv0.vcf"; done > extract_lv0.sh
parallel -j 48 :::: extract_lv0.sh
Before classing the variant sites into a category, we will convert the multi-allelic sites into bi-allelic sites. We also normalize to correct the position of the variant and simplify it if possible.
for i in $(ls deconstruct/parse/*lv0.vcf); do name=$(basename ${i} .vcf); ref_name=$(basename ${i} _ae.lv0.vcf | sed 's:^.*smooth.on.::g' | sed 's:.chr.*$::g' | sed 's:hap:Hap:g'); echo "bcftools norm --output deconstruct/parse/${name}.norm.vcf --output-type v --threads 4 -m-any --fasta-ref assembly/${ref_name}.fasta --check-ref w ${i} 2> logs/${name}.norm.err"; done > norm_vcf.sh
parallel -j 6 :::: norm_vcf.sh
# extract
## SNPs
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "bcftools view --output-file deconstruct/parse/${name}.snp.vcf --output-type v --threads 4 --types snps ${i} 2> logs/${name}.snp.err"; done > extract_snp.sh
parallel -j 6 :::: extract_snp.sh
## INDELs
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "bcftools view --output-file deconstruct/parse/${name}.indel.vcf --output-type v --threads 4 --types indels ${i} 2> logs/${name}.indel.err"; done > extract_indel.sh
parallel -j 6 :::: extract_indel.sh
## INDELs
for i in $(ls deconstruct/parse/*lv0.norm.indel.vcf); do name=$(basename ${i} .indel.vcf); echo "bcftools filter --include 'strlen(REF)<strlen(ALT)' --output deconstruct/parse/${name}.ins.vcf --output-type v --threads 4 ${i} 2> logs/${name}.ins.err"; done > extract_ins.sh
parallel -j 6 :::: extract_ins.sh
## INDELs
for i in $(ls deconstruct/parse/*lv0.norm.indel.vcf); do name=$(basename ${i} .indel.vcf); echo "bcftools filter --include 'strlen(REF)>strlen(ALT)' --output deconstruct/parse/${name}.del.vcf --output-type v --threads 4 ${i} 2> logs/${name}.del.err"; done > extract_del.sh
parallel -j 6 :::: extract_del.sh
## MNPs
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "bcftools view --output-file deconstruct/parse/${name}.mnp.vcf --output-type v --threads 4 --types mnps ${i} 2> logs/${name}.mnp.err"; done > extract_mnp.sh
parallel -j 6 :::: extract_mnp.sh
## other
for i in $(ls deconstruct/parse/*lv0.norm.vcf); do name=$(basename ${i} .vcf); echo "grep -vwFf <(cat deconstruct/parse/${name}.snp.vcf deconstruct/parse/${name}.indel.vcf deconstruct/parse/${name}.mnp.vcf) ${i} > deconstruct/parse/${name}.other.vcf 2> logs/${name}.other.err"; done > extract_other.sh
parallel -j 2 :::: extract_other.sh
The sum of bi.snp, bi.indel, bi.mnp, multi and other must be equal to the initial file
# summarize the nb
for i in $(ls deconstruct/parse/*ae.lv0.norm.other.vcf); do name=$(basename ${i} .other.vcf); if [ -e "deconstruct/parse/stats/${name}.stats_nb.txt" ]; then echo "${name} already done, skipping..."; else snp=$(grep -v '#' deconstruct/parse/${name}.snp.vcf | wc -l); ins=$(grep -v '#' deconstruct/parse/${name}.ins.vcf | wc -l); del=$(grep -v '#' deconstruct/parse/${name}.del.vcf | wc -l); mnp=$(grep -v '#' deconstruct/parse/${name}.mnp.vcf | wc -l); other=$(grep -v '#' deconstruct/parse/${name}.other.vcf | wc -l); tot=$(grep -v '#' deconstruct/parse/${name}.vcf | wc -l); full=$(cat deconstruct/parse/${name}.snp.vcf deconstruct/parse/${name}.ins.vcf deconstruct/parse/${name}.del.vcf deconstruct/parse/${name}.mnp.vcf deconstruct/parse/${name}.other.vcf | grep -v '#' | sort -u | wc -l); check=$(if [ "$full" == "$tot" ]; then echo "TRUE"; else echo "FALSE"; fi); echo -e "snp\t${snp}\nins\t${ins}\ndel\t${del}\nmnp\t${mnp}\nother\t${other}\nfull\t${full}\ntot\t${tot}\ncheck\t${check}" > deconstruct/parse/stats/${name}.stats_nb.txt; fi; done
Graph-inferred gene pangenome <- Previous Step | home | Next Step -> Annotate variants