Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The download of reference genomes #84

Open
lipumpkin opened this issue Mar 22, 2022 · 2 comments
Open

The download of reference genomes #84

lipumpkin opened this issue Mar 22, 2022 · 2 comments
Assignees

Comments

@lipumpkin
Copy link

lipumpkin commented Mar 22, 2022

Hi, professor fasnicar
Now i have a question about the option -g in phylophlan_get_reference.
I downloaded ref genomes for genus Acinetobacter by this command (phylophlan_get_reference -g g__Acinetobacter -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log). And i got 227 genomes of this genus finally. The txt(assembly_summary_genbank.txt) shows that over 10,000 species belong to genus Acinetobacter. And then I tried other command (-n 300), but i got 806 genomes finally.
On what basis were these 227 or 806 species selected? And did they include all child taxa (species) with a validly published of the genus?
Thanks

@fasnicar fasnicar self-assigned this Mar 29, 2022
@fasnicar
Copy link
Collaborator

Hi, the -n parameter is an "up to" for each single species. To make an example, let's assume you specify (as you reported above):

phylophlan_get_reference -g g__Acinetobacter -o input_genomes/ -n 5

then up to 5 genomes for each species listed under g__Acinetobacter will be downloaded.
Now, again for the sake of the example, assume that there are only 3 species followed by the number of available genomes:

g__Acinetobacter|s__species_1    3
g__Acinetobacter|s__species_2    15
g__Acinetobacter|s__species_3    6

In total, you have that there are 24 genomes, but you end up downloading 13 since s__species_1 only have 3 genomes.

Now, if you check phylophlan_get_reference -l | grep "g__Acinetobacter" | less -S you'll find:

k__Bacteria|p__Proteobacteria|[..]|f__Moraxellaceae|g__Acinetobacter       227     2984

The above means that there are 227 species listed under g__Acinetobacter and in total there are 2984 genomes that can be retrieved. So, it makes sense that you downloaded 227 genomes with -n 1 and 806 with -n 300
As there is s__Acinetobacter_baumannii with 2478 genomes.

I hope this helps.

Thanks,
Francesco

@lipumpkin
Copy link
Author

Hi, thank you very much.

I have fully understand the meaning of the -n parameter.
There is no doubt that your answers help me understand this code better.

Thanks,
Zikun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants