Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering reads with ambiguous content #213

Open
3 tasks
standage opened this issue Feb 23, 2018 · 0 comments
Open
3 tasks

Filtering reads with ambiguous content #213

standage opened this issue Feb 23, 2018 · 0 comments

Comments

@standage
Copy link
Collaborator

Our current handling of reads with ambiguous content is as follows.

  • For counting, kevlar uses khmer's default bulk loading behavior, which is to ignore all k-mers with ambiguous content. I think. Or it might actually not "handle" ambiguous characters at all, since MurmurHash will happily take any arbitrary input.
  • For finding novel k-mers, kevlar discards any reads with non [ACGT] characters.
  • Now that mate sequences are retained along with a novel read, no checks for ambiguous content are made on mate sequences at any step.

I'd suggest the following.

  • Write some tests to verify how reads/k-mers are handled in bulk loading.
  • Consider setting(s) that allow a user to specify a maximum number or proportion of ambiguous nucleotides in the read (or both), split on ambiguous nucleotides, and then look for interesting k-mers in the resulting fragments with length ≥ k.
  • Apply a similar setting (could be the same setting) to mate sequences: only retain mates that satisfy some count/proportion criteria for ambiguous nucleotides. We don't want to try to map reads with tons of Ns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant