Questions: Improving PROB_GERMLINE for better FDR filtering. #424

bwubb · 2024-04-09T01:14:26Z

Greetings,

I have a paired somatic pipeline and two of the callers it uses are vardict and varscan2, which will output germline calls as well. I wanted to include them into my VLR candidates because I figure it would distinguish them as PROB_GERMLINE and I could select and filter for that even in a separate output bcf. Most of our tumor/normal pairs have some sort of known inherent variant and we want to observe its presence and allele frequency in the tumor variant calling (usually for loh purposes).

My scenario grammar looks like: (contamination fraction is filled accurately for each sample pair)

samples:
  normal:
    resolution: 0.1
    universe: "0.0 | 0.5 | 1.0 | ]0.0,0.5["
  tumor:
    resolution: 0.01
    universe: "[0.0,1.0]"
    contamination:
      by: normal
      fraction: 0.0

expressions:
  ffpe: '(C>T | G>A) & ((tumor:0.0 & normal:]0.0,0.1]) | (tumor:]0.0,0.1] & normal:0.0))'

events:
  ffpe_artifact:
    '$ffpe'
  germline:
    'normal:0.5 | normal:1.0'
  somatic_normal:
    'normal:]0.0,0.5['
  somatic_tumor:
    'normal:0.0 & tumor:]0.0,1.0] & !$ffpe'

When it comes to the FDR filtering value I need to set it very high (~0.2-0.3) in order to capture these known germline variants in the output for PROB_GERMLINE. Ive repeated this in two separate datasets/tumor types each with ~50 pairs at least. When I look at the probability values at these known variant sites, PROB_GERMLINE is the the lowest value among all the evaluated events, but the values themselves seems very high.

I guess because they are "known to be real" and have good coverage I was expecting the model to be more confident and I could drop the FDR value down to 0.05. I am trying to understand if my grammar could be improved? Is VLR less suited for making GERMLINE calls in tumor/normal pairs? Maybe I am misunderstanding something else. Thank you

-bwubb

The text was updated successfully, but these errors were encountered:

bwubb · 2024-09-10T13:25:28Z

Oh I did not realize I had closed this! I still have many questions trying to improve my scenario grammar. Was there some detailed documentation that I am missing?

I do not understand the strictness of "germline: 'normal:0.5 | normal:1.0'" and I think it is causing me grief. I am observing some known germline variants at a freq as low as 0.40 due to sequencing coverage. It feels like I should drop somatic_normal. I want to understand what difference something like germline: 'normal:[0.4,0.6] | normal:1.0 would' make.

Since this is a tumor/normal study I will have reads for germline variants in the tumor. Should I be including that in the event?

I suppose I could guess and test, but I was hoping to gain some insight. Thank you.

bwubb · 2024-09-10T14:15:22Z

Is there a way I can provide some prior knowledge to improve the model and potentially yield a more accurate posterior?

I have some variants called by Mutect2 and passing its own filtering voodoo.
Its an AD of [72,5] and a G>A change and therefore could absolutely be an FFPE artifact and varlociraptor drops its AF to 0. But its a known clinically pathogenic variant.

I have some level of allele frequency information about it from the most recent ClinVar vcf, but Im not sure how I can use it. And I would want to do this for all pathogenic variants in my tumor type.

johanneskoester · 2024-09-11T07:18:22Z

Very good point. We can indeed add functionality to Varlociraptor to consider a population based prior per variant. So far, it only supports global heterozygosity or somatic mutation rate.

dlaehnemann · 2024-09-11T09:41:24Z

Oh I did not realize I had closed this! I still have many questions trying to improve my scenario grammar. Was there some detailed documentation that I am missing?

As for some more detailed documentation, I just want to make sure you are aware of these two sources:

The Generic variant calling section of the varlociraptor docs should explain all the details of the scenario. If you have particular questions about parts of it, or find that something is missing, this would be important feedback for improving this!
There is a whole catalogue of example scenarios for variant calling in the docs. Maybe looking through those examples also helps clear some things up.

I do not understand the strictness of "germline: 'normal:0.5 | normal:1.0'" and I think it is causing me grief. I am observing some known germline variants at a freq as low as 0.40 due to sequencing coverage. It feels like I should drop somatic_normal. I want to understand what difference something like germline: 'normal:[0.4,0.6] | normal:1.0 would' make.

There are some things in your above scenario, where I think changes might improve things:

[ Point 2 edited and point 4 removed after a conversation with Johannes. Removal of 4, because I was actually mixing the concepts of the maximum likelihood allele frequency with the probability of events. Event definitions will only affect the latter, and not the AF reported. ]

The contamination: fraction: 0.0 seems unusual. This implies that you have a pure tumor sample, so no contaminating normal cells. This should not really affect the allele frequency of germline variants, but probably makes sense adjusting, if you have at least a rough idea of contamination / purity.
I think the universe: "0.0 | 0.5 | 1.0 | ]0.0,0.5[" was simply in the default config/scenario.yaml when you deploy the workflow, right? This so far excludes the ]0.5,1.0[ range without any clear justification. The rough justification for this exclusion is that getting a variant in this range in the normal sample is very unlikely (either a loss of heterozygosity towards the alternative allele in some of the normal cells, or a ref->alt mutation of the ref allele at a heterozygous germline site). Anyway, I would suggest to switch to universe: "[0.0,1.0]" for the normal sample and try to make less of a distinction between germline and somatic variants in the normal sample, as sampling variance will probably mess with the clear distinction between those cases, anyways. I also suggest as much in a PR adjusting the default scenario.yaml accordingly. Not sure how this might influence your specific germline variant, but this will change the overall distribution and might thus have an indirect effect. Also, we are not yet sure if there is a resolution for correctly working with the full universe that works across the board, as excluding the ]0.5,1.0[ can also reduce a number of false positives. But Johannes' suggestion of variant-specific priors instead of general heterozygosity priors might already be a big step forward for this, especially in clinical settings.
One thing that should definitely change the allele frequency estimations is increasing the normal sample resolution. From resolution: 0.1 to resolution: 0.01. The advantage of a lower resolution is that the model has to evaluate fewer points and will run more quickly. But the downside is that it will not evaluate points in between the resolution grid, but will instead "concentrate" the respective probability mass at these grid points (not sure I'm using the correct statistics lingo here, I am glad to be corrected). So having more grid points can avoid the maximum likelihood estimation jumping to 0.4, if the "real maximumg" is 0.44, for example. So having a bit more resolution can be helpful to avoid funny behaviour, but I am also not sure if this would salvage your particular germline variant.

Since this is a tumor/normal study I will have reads for germline variants in the tumor. Should I be including that in the event?

Any germline variant should also be in the tumor, unless it is "mutated away", right? So if you have a germline variant of frequency 0.5, you also expect it to be at 0.5 in the tumor, and thus to be unaffected by purity / contamination values. Just trying to recap my thinking, to make sure we are on the same page.

I suppose I could guess and test, but I was hoping to gain some insight. Thank you.

Yeah, a bit of systematic testing is probably a good idea. If you just want to play around with the one specific germline variant you mentioned, you might want to create a testcase with just this variant. varlociraptor has built-in functionality to automate this testcase generation:
https://varlociraptor.github.io/docs/reporting-issues/

And then, you can just rerun the testcase lots of time, modifying the scenario.yaml step by step.

johanneskoester · 2024-09-11T12:10:13Z

Here is a PR that starts to implement per-variant priors: #448

bwubb · 2024-09-16T19:51:01Z

Thank you @dlaehnemann.
This is very helpful. I did have a lot of default or example scenario things copied into mine. Im starting to understand how they are affecting my output values.

The contamination is just a placeholder. It it opened and rewritten with the estimated values, per tumor/normal pair, during my pipeline execution. But thank you for the concern.

bwubb · 2024-09-16T19:56:15Z

@johanneskoester Thank you I look forward to this feature. If Im understanding your comments correctly you are aiming for a specific INFO field? Would this require any change to the scenario grammar?

bwubb closed this as completed Apr 24, 2024

bwubb reopened this Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions: Improving PROB_GERMLINE for better FDR filtering. #424

Questions: Improving PROB_GERMLINE for better FDR filtering. #424

bwubb commented Apr 9, 2024

bwubb commented Sep 10, 2024

bwubb commented Sep 10, 2024

johanneskoester commented Sep 11, 2024

dlaehnemann commented Sep 11, 2024 •

edited

Loading

johanneskoester commented Sep 11, 2024

bwubb commented Sep 16, 2024

bwubb commented Sep 16, 2024

Questions: Improving PROB_GERMLINE for better FDR filtering. #424

Questions: Improving PROB_GERMLINE for better FDR filtering. #424

Comments

bwubb commented Apr 9, 2024

bwubb commented Sep 10, 2024

bwubb commented Sep 10, 2024

johanneskoester commented Sep 11, 2024

dlaehnemann commented Sep 11, 2024 • edited Loading

johanneskoester commented Sep 11, 2024

bwubb commented Sep 16, 2024

bwubb commented Sep 16, 2024

dlaehnemann commented Sep 11, 2024 •

edited

Loading