Highly similar structures are clustered into separate groups or result in error #371

NatureGeorge · 2024-10-22T07:57:02Z

Expected Behavior

Given a directory containing the PDB files with the following PDB IDs:

8G2V,7UZI

Among them, the chain instances (A, B, C, D, E, F, G, H, I, J) of 8G2V share nearly identical structure and thus should be clustered into the same group.

Current Behavior

Each chain instance of 8G2V be a idependent cluster.

Steps to Reproduce (for bugs)

foldseek easy-cluster 8G2V.cif.gz 7UZI.cif.gz result tmp --tmscore-threshold 0.5

Context

There would be another problem if run:

foldseek easy-cluster 8G2V.cif.gz result tmp --tmscore-threshold 0.5

giving:

easy-cluster 8G2V.cif.gz result tmp --tmscore-threshold 0.5

MMseqs Version:                         9.427df8a
Substitution matrix                     aa:3di.out,nucl:3di.out
Seed substitution matrix                aa:3di.out,nucl:3di.out
Sensitivity                             4
k-mer length                            0
Target search mode                      0
k-score                                 seq:2147483647,prof:2147483647
Max sequence length                     65535
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Coverage threshold                      0
Coverage mode                           0
Compositional bias                      1
Compositional bias                      1
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                1
Minimum diagonal score                  30
Selected taxa
Spaced k-mers                           1
Preload mode                            0
Spaced k-mer pattern
Local temporary path
Threads                                 20
Compressed                              0
Verbosity                               3
TMscore threshold                       0.5
LDDT threshold                          0
Sort by structure bit score             1
Alignment type                          2
Exact TMscore                           0
Add backtrace                           false
Alignment mode                          0
Alignment mode                          0
E-value threshold                       10
Seq. id. threshold                      0
Min alignment length                    0
Seq. id. mode                           0
Alternative alignments                  0
Max reject                              2147483647
Max accept                              2147483647
Gap open cost                           aa:10,nucl:10
Gap extension cost                      aa:1,nucl:1
TMalign hit order                       0
TMalign fast                            1
Cluster mode                            0
Max connected component depth           1000
Similarity type                         2
Weight file name
Cluster Weight threshold                0.9
Single step clustering                  false
Cascaded clustering steps               3
Cluster reassign                        false
Remove temporary files                  true
Force restart with latest tmp           false
MPI runner
k-mers per sequence                     21
Scale k-mers per sequence               aa:0.000,nucl:0.200
Adjust k-mer length                     false
Shift hash                              67
Include only extendable                 false
Skip repeating k-mers                   false
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Path to ProstT5
Chain name mode                         0
Write mapping file                      0
Mask b-factor threshold                 0
Coord store mode                        2
Write lookup file                       1
Input format                            0
File Inclusion Regex                    .*
File Exclusion Regex                    ^$

cluster tmp/7126666531623036926/input tmp/7126666531623036926/clu tmp/7126666531623036926/clu_tmp --tmscore-threshold 0.5 --remove-tmp-files 1

Set cluster sensitivity to -s 8.000000
Set cluster mode SET COVER
Set cluster iterations to 3
tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ca exists and will be overwritten
createsubdb tmp/7126666531623036926/clu_tmp/4050237725070610072/clu_redundancy tmp/7126666531623036926/input_ca tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ca -v 3 --subdb-mode 1

Time for merging to input_step_redundancy_ca: 0h 0m 0s 0ms
Time for processing: 0h 0m 0s 1ms
prefilter tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ss tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ss tmp/7126666531623036926/clu_tmp/4050237725070610072/pref_step0 --sub-mat 'aa:3di.out,nucl:3di.out' --seed-sub-mat 'aa:3di.out,nucl:3di.out' -s 1 -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 100 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 0 --comp-bias-corr 0 --comp-bias-corr-scale 1 --diag-score 0 --exact-kmer-matching 0 --mask 0 --mask-prob 0.9 --mask-lower-case 1 --min-ungapped-score 0 --add-self-matches 1 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 20 --compressed 0 -v 3

Query database size: 10 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 10 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6
Index table: counting k-mers
[=================================================================] 100.00% 10 0s 0ms
Index table: Masked residues: 0
No k-mer could be extracted for the database tmp/7126666531623036926/clu_tmp/4050237725070610072/input_step_redundancy_ss.
Maybe the sequences length is less than 14 residues.
Error: Prefilter step 0 died
Error: Search died

Your Environment

Which foldseek version was used (Statically-compiled, self-compiled, Conda, etc.): conda 9.427df8a

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highly similar structures are clustered into separate groups or result in error #371

Highly similar structures are clustered into separate groups or result in error #371

NatureGeorge commented Oct 22, 2024

Highly similar structures are clustered into separate groups or result in error #371

Highly similar structures are clustered into separate groups or result in error #371

Comments

NatureGeorge commented Oct 22, 2024

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Your Environment