-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of own sequences for splitting #76
Comments
Thanks for using it! You can exclude biounits that contain chains similar to a chain from the PDB by using the So if you already have a dataset that is split into subsets, you can run There's also the The excluded files will be moved to the I have actually just pushed the latest version of the package (1.4.0) that deals with those files a bit better and updates the split dictionaries accordingly. |
I mean, right now I have 11 sets of different sequences. Now they are
public, but needn’t be. I’m not sure how what you have shown would allow
me to run it. I have fasta sequences, no PDB Id as some of these don’t have
crystals - so AF models are used. In addition to being more complete
structures (ie no missing loops, density, etc etc).
If that’s beyond scope, I understand, but it would certainly be useful as
an input. Just a fasta wirh a set of sequences. Many times this is what we
have for training.
…On Fri, Jun 23, 2023 at 1:25 PM Liza Kozlova ***@***.***> wrote:
Closed #76 <#76> as
completed.
—
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZDHRDURTSCBSIMMKFL2OTXMXGQPANCNFSM6AAAAAAZRYNYB4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
The current work around is to look up your sequences in the PDB database and find the closest homolog to the sequence (hopefully there is something with >90% sequence similarity to some of your sequences). Then you can use the homolog's PDB IDs with the |
I'll reopen the issue so that we keep it in mind. |
Thanks for the workaround, keeping it open, and the info about the future!
…On Tue, Jun 27, 2023 at 12:44 PM Liza Kozlova ***@***.***> wrote:
Reopened #76 <#76>.
—
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZDHRH2G7IJ4UXOAFID6NLXNMEWZANCNFSM6AAAAAAZRYNYB4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
So, this doesn't fix the issue. This just excludes sequences - the functionality is more about our own sequences/structures, especially from AF. IE many of these do not have PDBIds |
Not everyone wants to use this on PDB structures. That’s the major problem.
Some need it using alphafold structures. Some projects don’t need
structures at all…
And there is currently no way to address either of these…
…On Wed, Oct 11, 2023 at 7:25 AM Liza Kozlova ***@***.***> wrote:
With the new --exclude_chains_file option (#113
<#113> ) it's possible to
exclude custom sequences (just put them in a text file, one line = one
amino acid sequence, no PDB id required). Is there something else we should
add here @jadolfbr <https://github.com/jadolfbr> ?
—
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZDHRGS3F7O2ZEA5TI72FTX6Z6ZPANCNFSM6AAAAAAZRYNYB4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do. |
Yes, that is what I mean. Not every uniprot sequence has a PDB, so being
able to give a custom dataset in a fasta format or something like that
would be ideal. In addition, perhaps those sequences were designs or sets
of sequences/variants from experiments…
…On Thu, Oct 12, 2023 at 5:06 AM Liza Kozlova ***@***.***> wrote:
Alright, so just to make it very clear, right now structures are not
required if you want to exclude files based on sequences. Any custom
sequence can be added to the text file specified with
--exclude_chains_file, no PDB format or structure of any kind is needed.
If you want to exclude proteins that have sequences similar to "AAAAVWFAAA"
or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that
contains those two lines and pass this file as an option when splitting the
data (--exclude_chains_file excluded.txt).
We do not have support for generating new datasets using something other
than SAbDab or PDB as the source, however, if that is what you mean. That
kind of generalisation is harder to do.
—
Reply to this email directly, view it on GitHub
<#76 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZDHRFIGHWQP6XCNPTQAZTX66XHZANCNFSM6AAAAAAZRYNYB4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Really nice package! One thing I feel is missing is being able to split based on a set of sequences, for example, sequences that may have some biophysical properties one is trying to predict using ML methods.
I did not find a way to do this if it already exists.
The text was updated successfully, but these errors were encountered: