Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of own sequences for splitting #76

Closed
jadolfbr opened this issue Jun 23, 2023 · 10 comments · Fixed by #113
Closed

Use of own sequences for splitting #76

jadolfbr opened this issue Jun 23, 2023 · 10 comments · Fixed by #113

Comments

@jadolfbr
Copy link

Really nice package! One thing I feel is missing is being able to split based on a set of sequences, for example, sequences that may have some biophysical properties one is trying to predict using ML methods.

I did not find a way to do this if it already exists.

@elkoz
Copy link
Collaborator

elkoz commented Jun 23, 2023

Thanks for using it!

You can exclude biounits that contain chains similar to a chain from the PDB by using the --exclude_chains and --exclude_threshold arguments of proteinflow split.

So if you already have a dataset that is split into subsets, you can run proteinflow unsplit --tag {tag}, or if you want to download / generate a new one, run proteinflow download or proteinflow generate with the --skip_splitting flag.
And then run e.g. proteinflow split --tag {tag} --exclude_chains 7kgk-A --exclude_threshold 0.7 --ignore_existing to split the dataset again and exclude all biounits that contain chains that are more than 70% similar to 7kgk-A.

There's also the --exclude_clusters option to exclude whole clusters if one of those biounits belongs there and --exclude_based_on_cdr to only exclude particular CDR clusters (for SAbDab datasets).

The excluded files will be moved to the excluded folder at the same level as train, test and valid.

I have actually just pushed the latest version of the package (1.4.0) that deals with those files a bit better and updates the split dictionaries accordingly.

@elkoz elkoz closed this as completed Jun 23, 2023
@jadolfbr
Copy link
Author

jadolfbr commented Jun 23, 2023 via email

@danielnzg85
Copy link
Contributor

danielnzg85 commented Jun 26, 2023

The current work around is to look up your sequences in the PDB database and find the closest homolog to the sequence (hopefully there is something with >90% sequence similarity to some of your sequences). Then you can use the homolog's PDB IDs with the --exclude_chains tag to have a similar splitting outcome. This is not perfect and might not work in your case, but if you need to do this asap you can implement it this way. We plan to add support to exclude proteins by sequence in next releases.

@elkoz
Copy link
Collaborator

elkoz commented Jun 27, 2023

I'll reopen the issue so that we keep it in mind.

@elkoz elkoz reopened this Jun 27, 2023
@jadolfbr
Copy link
Author

jadolfbr commented Jun 27, 2023 via email

@jadolfbr
Copy link
Author

So, this doesn't fix the issue. This just excludes sequences - the functionality is more about our own sequences/structures, especially from AF. IE many of these do not have PDBIds

@elkoz
Copy link
Collaborator

elkoz commented Oct 11, 2023

With the new --exclude_chains_file option (#113 ) it's possible to exclude custom sequences (just put them in a text file, one line = one amino acid sequence, no PDB id required). Is there something else we should add here @jadolfbr ?

@jadolfbr
Copy link
Author

jadolfbr commented Oct 11, 2023 via email

@elkoz
Copy link
Collaborator

elkoz commented Oct 12, 2023

Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with --exclude_chains_file, no PDB format or structure of any kind is needed. If you want to exclude proteins that have sequences similar to "AAAAVWFAAA" or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that contains those two lines and pass this file as an option when splitting the data (--exclude_chains_file excluded.txt).

We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do.

@jadolfbr
Copy link
Author

jadolfbr commented Oct 12, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants