Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions: Improving PROB_GERMLINE for better FDR filtering. #424

Open
bwubb opened this issue Apr 9, 2024 · 7 comments
Open

Questions: Improving PROB_GERMLINE for better FDR filtering. #424

bwubb opened this issue Apr 9, 2024 · 7 comments

Comments

@bwubb
Copy link

bwubb commented Apr 9, 2024

Greetings,

I have a paired somatic pipeline and two of the callers it uses are vardict and varscan2, which will output germline calls as well. I wanted to include them into my VLR candidates because I figure it would distinguish them as PROB_GERMLINE and I could select and filter for that even in a separate output bcf. Most of our tumor/normal pairs have some sort of known inherent variant and we want to observe its presence and allele frequency in the tumor variant calling (usually for loh purposes).

My scenario grammar looks like: (contamination fraction is filled accurately for each sample pair)

samples:
  normal:
    resolution: 0.1
    universe: "0.0 | 0.5 | 1.0 | ]0.0,0.5["
  tumor:
    resolution: 0.01
    universe: "[0.0,1.0]"
    contamination:
      by: normal
      fraction: 0.0

expressions:
  ffpe: '(C>T | G>A) & ((tumor:0.0 & normal:]0.0,0.1]) | (tumor:]0.0,0.1] & normal:0.0))'

events:
  ffpe_artifact:
    '$ffpe'
  germline:
    'normal:0.5 | normal:1.0'
  somatic_normal:
    'normal:]0.0,0.5['
  somatic_tumor:
    'normal:0.0 & tumor:]0.0,1.0] & !$ffpe'

When it comes to the FDR filtering value I need to set it very high (~0.2-0.3) in order to capture these known germline variants in the output for PROB_GERMLINE. Ive repeated this in two separate datasets/tumor types each with ~50 pairs at least. When I look at the probability values at these known variant sites, PROB_GERMLINE is the the lowest value among all the evaluated events, but the values themselves seems very high.

I guess because they are "known to be real" and have good coverage I was expecting the model to be more confident and I could drop the FDR value down to 0.05. I am trying to understand if my grammar could be improved? Is VLR less suited for making GERMLINE calls in tumor/normal pairs? Maybe I am misunderstanding something else. Thank you

-bwubb

@bwubb bwubb closed this as completed Apr 24, 2024
@bwubb bwubb reopened this Sep 10, 2024
@bwubb
Copy link
Author

bwubb commented Sep 10, 2024

Oh I did not realize I had closed this! I still have many questions trying to improve my scenario grammar. Was there some detailed documentation that I am missing?

I do not understand the strictness of "germline: 'normal:0.5 | normal:1.0'" and I think it is causing me grief. I am observing some known germline variants at a freq as low as 0.40 due to sequencing coverage. It feels like I should drop somatic_normal. I want to understand what difference something like germline: 'normal:[0.4,0.6] | normal:1.0 would' make.

Since this is a tumor/normal study I will have reads for germline variants in the tumor. Should I be including that in the event?

I suppose I could guess and test, but I was hoping to gain some insight. Thank you.

@bwubb
Copy link
Author

bwubb commented Sep 10, 2024

Is there a way I can provide some prior knowledge to improve the model and potentially yield a more accurate posterior?

I have some variants called by Mutect2 and passing its own filtering voodoo.
Its an AD of [72,5] and a G>A change and therefore could absolutely be an FFPE artifact and varlociraptor drops its AF to 0. But its a known clinically pathogenic variant.

I have some level of allele frequency information about it from the most recent ClinVar vcf, but Im not sure how I can use it. And I would want to do this for all pathogenic variants in my tumor type.

@johanneskoester
Copy link
Contributor

Very good point. We can indeed add functionality to Varlociraptor to consider a population based prior per variant. So far, it only supports global heterozygosity or somatic mutation rate.

@dlaehnemann
Copy link
Member

dlaehnemann commented Sep 11, 2024

Oh I did not realize I had closed this! I still have many questions trying to improve my scenario grammar. Was there some detailed documentation that I am missing?

As for some more detailed documentation, I just want to make sure you are aware of these two sources:

  1. The Generic variant calling section of the varlociraptor docs should explain all the details of the scenario. If you have particular questions about parts of it, or find that something is missing, this would be important feedback for improving this!
  2. There is a whole catalogue of example scenarios for variant calling in the docs. Maybe looking through those examples also helps clear some things up.

I do not understand the strictness of "germline: 'normal:0.5 | normal:1.0'" and I think it is causing me grief. I am observing some known germline variants at a freq as low as 0.40 due to sequencing coverage. It feels like I should drop somatic_normal. I want to understand what difference something like germline: 'normal:[0.4,0.6] | normal:1.0 would' make.

There are some things in your above scenario, where I think changes might improve things:

[ Point 2 edited and point 4 removed after a conversation with Johannes. Removal of 4, because I was actually mixing the concepts of the maximum likelihood allele frequency with the probability of events. Event definitions will only affect the latter, and not the AF reported. ]

  1. The contamination: fraction: 0.0 seems unusual. This implies that you have a pure tumor sample, so no contaminating normal cells. This should not really affect the allele frequency of germline variants, but probably makes sense adjusting, if you have at least a rough idea of contamination / purity.
  2. I think the universe: "0.0 | 0.5 | 1.0 | ]0.0,0.5[" was simply in the default config/scenario.yaml when you deploy the workflow, right? This so far excludes the ]0.5,1.0[ range without any clear justification. The rough justification for this exclusion is that getting a variant in this range in the normal sample is very unlikely (either a loss of heterozygosity towards the alternative allele in some of the normal cells, or a ref->alt mutation of the ref allele at a heterozygous germline site). Anyway, I would suggest to switch to universe: "[0.0,1.0]" for the normal sample and try to make less of a distinction between germline and somatic variants in the normal sample, as sampling variance will probably mess with the clear distinction between those cases, anyways. I also suggest as much in a PR adjusting the default scenario.yaml accordingly. Not sure how this might influence your specific germline variant, but this will change the overall distribution and might thus have an indirect effect. Also, we are not yet sure if there is a resolution for correctly working with the full universe that works across the board, as excluding the ]0.5,1.0[ can also reduce a number of false positives. But Johannes' suggestion of variant-specific priors instead of general heterozygosity priors might already be a big step forward for this, especially in clinical settings.
  3. One thing that should definitely change the allele frequency estimations is increasing the normal sample resolution. From resolution: 0.1 to resolution: 0.01. The advantage of a lower resolution is that the model has to evaluate fewer points and will run more quickly. But the downside is that it will not evaluate points in between the resolution grid, but will instead "concentrate" the respective probability mass at these grid points (not sure I'm using the correct statistics lingo here, I am glad to be corrected). So having more grid points can avoid the maximum likelihood estimation jumping to 0.4, if the "real maximumg" is 0.44, for example. So having a bit more resolution can be helpful to avoid funny behaviour, but I am also not sure if this would salvage your particular germline variant.

Since this is a tumor/normal study I will have reads for germline variants in the tumor. Should I be including that in the event?

Any germline variant should also be in the tumor, unless it is "mutated away", right? So if you have a germline variant of frequency 0.5, you also expect it to be at 0.5 in the tumor, and thus to be unaffected by purity / contamination values. Just trying to recap my thinking, to make sure we are on the same page.

I suppose I could guess and test, but I was hoping to gain some insight. Thank you.

Yeah, a bit of systematic testing is probably a good idea. If you just want to play around with the one specific germline variant you mentioned, you might want to create a testcase with just this variant. varlociraptor has built-in functionality to automate this testcase generation:
https://varlociraptor.github.io/docs/reporting-issues/

And then, you can just rerun the testcase lots of time, modifying the scenario.yaml step by step.

@johanneskoester
Copy link
Contributor

Here is a PR that starts to implement per-variant priors: #448

@bwubb
Copy link
Author

bwubb commented Sep 16, 2024

Thank you @dlaehnemann.
This is very helpful. I did have a lot of default or example scenario things copied into mine. Im starting to understand how they are affecting my output values.

The contamination is just a placeholder. It it opened and rewritten with the estimated values, per tumor/normal pair, during my pipeline execution. But thank you for the concern.

@bwubb
Copy link
Author

bwubb commented Sep 16, 2024

@johanneskoester Thank you I look forward to this feature. If Im understanding your comments correctly you are aiming for a specific INFO field? Would this require any change to the scenario grammar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants