Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniprot database differing number of markers #93

Open
pavlo888 opened this issue Jun 20, 2022 · 4 comments
Open

uniprot database differing number of markers #93

pavlo888 opened this issue Jun 20, 2022 · 4 comments
Assignees

Comments

@pavlo888
Copy link

Dear @fasnicar

Thanks for this amazing tool! I have already run it once and it went smoothly and the results were very informative.

However, I am trying to run the analysis again but I am a bit conflicted about it. In the first run, I used 228 genomes and when I downloaded the Uniprot ref database, it retrieved 2,142 markers. And now that I am trying to re-run the analysis with 218 genomes, the Uniprot ref database retrieved only 1,991 markers.

Is there any reason for the differences in the number of markers? And also, which set would you recommend using? I would assume the one with more markers, since it would be more informative.

Thanks a lot for your help!

Best regards,
Pablo

@fasnicar fasnicar self-assigned this Jun 30, 2022
@fasnicar
Copy link
Collaborator

Dear Pablo, many thanks for using PhyloPhlAn.

So, I believe you downloaded twice the same set of UniRef90s for the same taxonomic label, correct?
If that's the case and you see ~200 fewer proteins, it could be due to the fact UniRef IDs are not permanent and are changed continuosly. The phylophlan_setup_database tries to resolve old IDs into new ones, but it could be that for some of them the UniRef API is not providing a new ID. In this case, phylophlan_setup_database will write the <SPECIES_LABEL>_core_proteins_not_mapped.txt file listing the missing IDs.
If you tried downloading the same set of UniRef90 proteins twice, then I would say that you can use the older one, as there should actually be no difference/advantage between the two databases.

Many thanks,
Francesco

@alexhbnr
Copy link

alexhbnr commented Aug 1, 2022

Dear Francesco,

I have a follow-up question to this. I had a similar issue, where the script phylophlan_setup_database could only retrieve 1,181 instead of the 3,369 proteins listed for "s__Yersinia_pesits". When doing some manual research, why many of these failed, I found the case that you described above when a protein has been deprecated and moved from UniProt to UniParc. However, I also found the case that the URL to download the protein sequence from https://rest.uniprot.org/uniprotkb/{}.fasta instead of the URL http://www.uniprot.org/uniref/UniRef90_{}.fasta. When using the latter download URL to the same list of protein IDs listed in your database taxa2core_cpa201901_up201901.txt.bz2, I was able to download all 3,369 of the 3,369 protein sequences compared to the 1,181 sequences with the former URL domain.

Do you by chance know if the download API has changed or is this in fact a different set of protein sequences?

@fasnicar
Copy link
Collaborator

Dear @alexhbnr, thanks for following up on this. UniProt APIs recently changed and I've updated PhyloPhlAn as you can read here #98. Did you try this latest version or are you still using the previous one? If you're using the latest, I'll be happy to give another look at it, otherwise, I would kindly ask you to try to get the latest from the GitHub repo and try it out to see if you still have the same issue.

Many thanks,
Francesco

@alexhbnr
Copy link

Hi @fasnicar ,
I finally had a chance to check your latest commit that you mentioned in #98 . It solves the issue for me, too. While still using the http://www.uniprot.org/uniref/UniRef90_{}.fasta returns a number of sequences that cannot be downloaded any longer, the alternative URL https://rest.uniprot.org/uniprotkb/{}.fasta that I posted previously doesn't solve this issue either. @maxibor pointed out to me that while it doesn't generate an error, it just creates empty FastA files.

After your latest update, I could download 2,592 genes independent whether I used either of the two URLs. So I consider the problem solved.

Thanks again for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants