Use of own sequences for splitting #76

jadolfbr · 2023-06-23T16:13:18Z

Really nice package! One thing I feel is missing is being able to split based on a set of sequences, for example, sequences that may have some biophysical properties one is trying to predict using ML methods.

I did not find a way to do this if it already exists.

elkoz · 2023-06-23T17:25:17Z

Thanks for using it!

You can exclude biounits that contain chains similar to a chain from the PDB by using the --exclude_chains and --exclude_threshold arguments of proteinflow split.

So if you already have a dataset that is split into subsets, you can run proteinflow unsplit --tag {tag}, or if you want to download / generate a new one, run proteinflow download or proteinflow generate with the --skip_splitting flag.
And then run e.g. proteinflow split --tag {tag} --exclude_chains 7kgk-A --exclude_threshold 0.7 --ignore_existing to split the dataset again and exclude all biounits that contain chains that are more than 70% similar to 7kgk-A.

There's also the --exclude_clusters option to exclude whole clusters if one of those biounits belongs there and --exclude_based_on_cdr to only exclude particular CDR clusters (for SAbDab datasets).

The excluded files will be moved to the excluded folder at the same level as train, test and valid.

I have actually just pushed the latest version of the package (1.4.0) that deals with those files a bit better and updates the split dictionaries accordingly.

jadolfbr · 2023-06-23T18:25:04Z

I mean, right now I have 11 sets of different sequences. Now they are public, but needn’t be. I’m not sure how what you have shown would allow me to run it. I have fasta sequences, no PDB Id as some of these don’t have crystals - so AF models are used. In addition to being more complete structures (ie no missing loops, density, etc etc). If that’s beyond scope, I understand, but it would certainly be useful as an input. Just a fasta wirh a set of sequences. Many times this is what we have for training.

…

On Fri, Jun 23, 2023 at 1:25 PM Liza Kozlova ***@***.***> wrote: Closed #76 <#76> as completed. — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZDHRDURTSCBSIMMKFL2OTXMXGQPANCNFSM6AAAAAAZRYNYB4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

danielnzg85 · 2023-06-26T07:46:55Z

The current work around is to look up your sequences in the PDB database and find the closest homolog to the sequence (hopefully there is something with >90% sequence similarity to some of your sequences). Then you can use the homolog's PDB IDs with the --exclude_chains tag to have a similar splitting outcome. This is not perfect and might not work in your case, but if you need to do this asap you can implement it this way. We plan to add support to exclude proteins by sequence in next releases.

elkoz · 2023-06-27T16:44:17Z

I'll reopen the issue so that we keep it in mind.

jadolfbr · 2023-06-27T18:02:33Z

Thanks for the workaround, keeping it open, and the info about the future!

…

On Tue, Jun 27, 2023 at 12:44 PM Liza Kozlova ***@***.***> wrote: Reopened #76 <#76>. — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZDHRH2G7IJ4UXOAFID6NLXNMEWZANCNFSM6AAAAAAZRYNYB4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jadolfbr · 2023-09-28T19:07:34Z

So, this doesn't fix the issue. This just excludes sequences - the functionality is more about our own sequences/structures, especially from AF. IE many of these do not have PDBIds

elkoz · 2023-10-11T11:25:00Z

With the new --exclude_chains_file option (#113 ) it's possible to exclude custom sequences (just put them in a text file, one line = one amino acid sequence, no PDB id required). Is there something else we should add here @jadolfbr ?

jadolfbr · 2023-10-11T16:26:07Z

Not everyone wants to use this on PDB structures. That’s the major problem. Some need it using alphafold structures. Some projects don’t need structures at all… And there is currently no way to address either of these…

…

On Wed, Oct 11, 2023 at 7:25 AM Liza Kozlova ***@***.***> wrote: With the new --exclude_chains_file option (#113 <#113> ) it's possible to exclude custom sequences (just put them in a text file, one line = one amino acid sequence, no PDB id required). Is there something else we should add here @jadolfbr <https://github.com/jadolfbr> ? — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZDHRGS3F7O2ZEA5TI72FTX6Z6ZPANCNFSM6AAAAAAZRYNYB4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

elkoz · 2023-10-12T09:05:53Z

Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with --exclude_chains_file, no PDB format or structure of any kind is needed. If you want to exclude proteins that have sequences similar to "AAAAVWFAAA" or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that contains those two lines and pass this file as an option when splitting the data (--exclude_chains_file excluded.txt).

We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do.

jadolfbr · 2023-10-12T14:33:14Z

Yes, that is what I mean. Not every uniprot sequence has a PDB, so being able to give a custom dataset in a fasta format or something like that would be ideal. In addition, perhaps those sequences were designs or sets of sequences/variants from experiments…

…

On Thu, Oct 12, 2023 at 5:06 AM Liza Kozlova ***@***.***> wrote: Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with --exclude_chains_file, no PDB format or structure of any kind is needed. If you want to exclude proteins that have sequences similar to "AAAAVWFAAA" or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that contains those two lines and pass this file as an option when splitting the data (--exclude_chains_file excluded.txt). We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do. — Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZDHRFIGHWQP6XCNPTQAZTX66XHZANCNFSM6AAAAAAZRYNYB4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

elkoz closed this as completed Jun 23, 2023

elkoz reopened this Jun 27, 2023

elkoz mentioned this issue Sep 13, 2023

Add the option to exclude custom sequences #113

Merged

elkoz closed this as completed in #113 Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of own sequences for splitting #76

Use of own sequences for splitting #76

jadolfbr commented Jun 23, 2023

elkoz commented Jun 23, 2023 •

edited

Loading

jadolfbr commented Jun 23, 2023 via email

danielnzg85 commented Jun 26, 2023 •

edited

Loading

elkoz commented Jun 27, 2023

jadolfbr commented Jun 27, 2023 via email

jadolfbr commented Sep 28, 2023

elkoz commented Oct 11, 2023

jadolfbr commented Oct 11, 2023 via email

elkoz commented Oct 12, 2023

jadolfbr commented Oct 12, 2023 via email

Use of own sequences for splitting #76

Use of own sequences for splitting #76

Comments

jadolfbr commented Jun 23, 2023

elkoz commented Jun 23, 2023 • edited Loading

jadolfbr commented Jun 23, 2023 via email

danielnzg85 commented Jun 26, 2023 • edited Loading

elkoz commented Jun 27, 2023

jadolfbr commented Jun 27, 2023 via email

jadolfbr commented Sep 28, 2023

elkoz commented Oct 11, 2023

jadolfbr commented Oct 11, 2023 via email

elkoz commented Oct 12, 2023

jadolfbr commented Oct 12, 2023 via email

elkoz commented Jun 23, 2023 •

edited

Loading

danielnzg85 commented Jun 26, 2023 •

edited

Loading