Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choose pileup call over full-alignment call if pileup call has 1) a way higher QUAL, 2) is an indel #318

Open
drtconway opened this issue Jul 2, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@drtconway
Copy link

Hi!

Thanks for making a good tool!

We have been doing testing and benchmarking with (clinical) trio data, and have found a corner case where the behaviour of the merge between the pileup calls and the full alignment call results in a real variant being discarded.

The pileup VCF file contains the row

chr12   132836482   .   A   ACTCACAGTGACAGGCTCCCAGCAGGGCGCACGGCACTCACAGTGACAGGCTCCCAGCACGGCGCACGGCACTCACAGTGACAGGCTCCCAGCACGGCGCTCGGCC  28.66   PASS    P   GT:GQ:DP:AD:AF:PL   1/1:28:32:3,29:0.9062:56,48,0

And the full alignment has the row

chr12   132836482   .   A   .   0.00    RefCall F   GT:GQ:DP:AD:AF:PL   0/0:0:32:3:0.0938:990

The VCF produced by MergeVcf outputs the RefCall, which causes the insertion to be dropped.

It's a bit of a corner case, since the position of insertion variants in a VCF is the base before the inserted sequence, and that is ref, but in this case, we do want to report the insertion!

I don't fully understand how the positions of the RefCall variants are determined, so I can't tell if this is a freak event, or if there is a systematic process that means these collisions are likely to occur elsewhere. If you could explain how these RefCall events are generated, that would help us understand how the caller works.

Looking at the code of MergeVcf (e.g.

full_alignment_output_set.add((ctg_name, pos))
and
if (ctg_name, pos) in full_alignment_output_set:
), I think I understand how the merge code is dropping the insertion event. I can think of a few possible ways to fix the code, but I haven't thought through all the semantics in detail, to work out what the best way to fix the code would be.

Again, thanks for making a good tool, and I hope this report helps you make it even better.

Tom.

@aquaskyline
Copy link
Member

Many thanks Tom. I have some ideas how to fix it but can't tell which one does less harm to other variant candidates. Is it possible that you send me a minibam containing only the reads covering chr12:132836482? I gonna look into the details.

@drtconway
Copy link
Author

I'll have to consult - as I indicated, it's a clinical sample.

The smallest expansion would be to track RefCall entries, either by adding to the tuple in the set, or perhaps keeping a separate set. Then, if the pileup variant is an indel, it "overrides" the refcall. As noted, indels are kind of special in the sense that they are recorded at a ref position immediately before the actual variant. If the variant from the full alignment is not a RefCall then it is probably ok to drop the variant from the pileup.

If you're keen, I would suggest refactoring the code to do a "classic" merge - iterating over the two VCFs at the same time. That way the cases where things collide will appear as a natural clause in the inner loop, and you don't need memory proportional to the size of the full alignment VCF.

@drtconway
Copy link
Author

Unfortunately, I can't give you a BAM with the reads.

@aquaskyline aquaskyline added the enhancement New feature or request label Jul 23, 2024
@aquaskyline aquaskyline changed the title RefCall overwrites variant choose pileup call over full-alignment call if pileup call has a way higher QUAL Jul 23, 2024
@aquaskyline aquaskyline changed the title choose pileup call over full-alignment call if pileup call has a way higher QUAL choose pileup call over full-alignment call if pileup call has 1) a way higher QUAL, 2) is an indel Jul 23, 2024
@drtconway
Copy link
Author

Just to update - we have found more instances of this problem. Unfortunately in clinical samples that we can't share. It's a significant problem for us, because in trios, it is causing some variants to appear to be de novo in the proband, when in fact, they are inherited, but the call has been "overwritten" in the parent.

@aquaskyline
Copy link
Member

Are all your instances called with good QUAL in pileup but not called in full-alignment? If yes, could you share with us the pileup QUAL and full-alignment QUAL of these instances? We need them to determine when to trust pileup indel calls over full-alignment calls.

@drtconway
Copy link
Author

Actually, the recent cases we encountered were actually even simpler. My colleague Jimmy is currently working on trying to reproduce the problem in HG002 (i.e. some publicly available data). I'll keep you posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants