You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our current handling of reads with ambiguous content is as follows.
For counting, kevlar uses khmer's default bulk loading behavior, which is to ignore all k-mers with ambiguous content. I think. Or it might actually not "handle" ambiguous characters at all, since MurmurHash will happily take any arbitrary input.
For finding novel k-mers, kevlar discards any reads with non [ACGT] characters.
Now that mate sequences are retained along with a novel read, no checks for ambiguous content are made on mate sequences at any step.
I'd suggest the following.
Write some tests to verify how reads/k-mers are handled in bulk loading.
Consider setting(s) that allow a user to specify a maximum number or proportion of ambiguous nucleotides in the read (or both), split on ambiguous nucleotides, and then look for interesting k-mers in the resulting fragments with length ≥ k.
Apply a similar setting (could be the same setting) to mate sequences: only retain mates that satisfy some count/proportion criteria for ambiguous nucleotides. We don't want to try to map reads with tons of Ns.
The text was updated successfully, but these errors were encountered:
Our current handling of reads with ambiguous content is as follows.
[ACGT]
characters.I'd suggest the following.
The text was updated successfully, but these errors were encountered: