Filtering reads with ambiguous content #213

standage · 2018-02-23T17:29:08Z

Our current handling of reads with ambiguous content is as follows.

For counting, kevlar uses khmer's default bulk loading behavior, which is to ignore all k-mers with ambiguous content. I think. Or it might actually not "handle" ambiguous characters at all, since MurmurHash will happily take any arbitrary input.
For finding novel k-mers, kevlar discards any reads with non [ACGT] characters.
Now that mate sequences are retained along with a novel read, no checks for ambiguous content are made on mate sequences at any step.

I'd suggest the following.

Write some tests to verify how reads/k-mers are handled in bulk loading.
Consider setting(s) that allow a user to specify a maximum number or proportion of ambiguous nucleotides in the read (or both), split on ambiguous nucleotides, and then look for interesting k-mers in the resulting fragments with length ≥ k.
Apply a similar setting (could be the same setting) to mate sequences: only retain mates that satisfy some count/proportion criteria for ambiguous nucleotides. We don't want to try to map reads with tons of Ns.

The text was updated successfully, but these errors were encountered:

standage added enhancement accuracy labels Feb 23, 2018

Provide feedback