Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more generic repeat annotation retrieval #9

Open
dlaehnemann opened this issue Apr 12, 2023 · 1 comment
Open

more generic repeat annotation retrieval #9

dlaehnemann opened this issue Apr 12, 2023 · 1 comment

Comments

@dlaehnemann
Copy link
Contributor

My current best and quickest solution for retrieval of repeat annotations is to make the RepeatMasker download link configurable via the config.yaml. However, this has several restrictions:

  1. It only works for the species and genome builds available on the Repeatmasker website, either through the species tree view or the species list view.
  2. It easily gets out of sync with the Ensembl reference species, build and release. These are also specified in the config.yaml file, right before the link spec. But who reads through those things in detail...

However, the download links for RepeatMasker do not seem systematic, with species names sometimes abbreviated (mm for mus musculus, hg for homo sapiens) and sometimes not (for example bosTau) and with only certain species available for certain releases of RepeatMasker and DFAM. So a somewhat systematic download rule with only meta-information provided in the config.yaml (and partly drawin on the Ensembl reference definitions) will not work.

An alternative would be to have a little RepeatMasker workflow with rules that:

  1. Download a specified version (for example 3.7) of the necessary DFAM transposable element specification (for example the Dfam_curatedonly.h5.gz.
  2. Run RepeatMasker on the workflow's Ensembl reference genome using this DFAM resource and generates the species.fa.out.gz files.

However, this seems like slightly excessive downloads and work, especially if one does not want to restrict the annotation to the curated set (the full dfam.h5.gz of version 3.7 is almost 90 GB) and would probably warrant something like a snakemake meta-wrapper. So I'll leave this as possible future work, if this workflow really gets applied more often and on non-standard species.

dlaehnemann added a commit that referenced this issue Apr 12, 2023
…tional (however, this is not fully automated -- for a solution to this, ideas are in #9)
@dlaehnemann
Copy link
Contributor Author

The current "best solution" is in:

4c44163

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant