Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade antiSMASH 7.1.0 #339

Merged
merged 20 commits into from
Apr 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
0485bff
chore: handle patch versioning in antiSMASH
matinnuhamunada Mar 13, 2024
096a925
fix: improve symlink generation for downsteram BGC prep and sanitize …
matinnuhamunada Apr 3, 2024
d50f6a9
fix: handle incompatible json schema from older antiSMASH version
matinnuhamunada Apr 3, 2024
42533e9
feat: upgrade antiSMASH version 7.1.0
matinnuhamunada Apr 3, 2024
5cf8be8
feat: upgrade BiG-SCAPE version 1.1.9
matinnuhamunada Apr 3, 2024
90b6d33
fix: update example BGC project to antismash 7.1.0
matinnuhamunada Apr 17, 2024
fe13583
chore: apply string formatting for project columns
matinnuhamunada Apr 17, 2024
a38fb1f
fix: handle input when both input_file and gbk_path in config sample
matinnuhamunada Apr 17, 2024
6bb8d07
feat: upgrade duckdb and metabase
matinnuhamunada Apr 17, 2024
9750419
fix: handle empty values
matinnuhamunada Apr 17, 2024
5b02478
chore: add getphylo example in config
matinnuhamunada Apr 17, 2024
3a78df6
chore: turn off getphylo as it is still in experimental phase
matinnuhamunada Apr 17, 2024
138cc5f
chore: pin version 0.2.1 for getphylo
matinnuhamunada Apr 17, 2024
dbe5c6c
feat: upgrade checkm
matinnuhamunada Apr 18, 2024
bf77b2b
feat: upgrade seqfu
matinnuhamunada Apr 18, 2024
d911b29
feat: upgrade clinker
matinnuhamunada Apr 18, 2024
a1f6da0
feat: upgrade cblaster
matinnuhamunada Apr 18, 2024
ef70724
fix: correct log path for patric download
matinnuhamunada Apr 18, 2024
29290e6
feat: add script to upload database to motherduck
matinnuhamunada Apr 18, 2024
c587e0c
chore: bump version to 0.9.0
matinnuhamunada Apr 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .examples/_config_example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,5 +76,5 @@ rule_parameters:
utility_parameters:
METABASE_MIN_MEMORY: "2g"
METABASE_MAX_MEMORY: "8g"
METABASE_VERSION: "v0.47.0"
METABASE_DUCKDB_PLUGIN_VERSION: "0.2.2"
METABASE_VERSION: "v0.49.6"
METABASE_DUCKDB_PLUGIN_VERSION: "0.2.6"

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
bgc_id,genome_id,region,accession,start_pos,end_pos,contig_edge,product,region_length,most_similar_known_cluster_id,most_similar_known_cluster_description,most_similar_known_cluster_type,similarity,source,gbk_path
CR954253.1.region001,GCA_000056065.1,1.1,CR954253.1,17407,39909,False,['lanthipeptide-class-iii'],22502,,,,,bgcflow,data/interim/antismash/7.1.0/GCA_000056065.1/CR954253.1.region001.gbk
CR954253.1.region003,GCA_000056065.1,1.3,CR954253.1,1745672,1767868,False,['lanthipeptide-class-iv'],22196,,,,,bgcflow,data/interim/antismash/7.1.0/GCA_000056065.1/CR954253.1.region003.gbk
CP000156.1.region002,GCA_000191165.1,1.2,CP000156.1,1767251,1789447,False,['lanthipeptide-class-iv'],22196,,,,,bgcflow,data/interim/antismash/7.1.0/GCA_000191165.1/CP000156.1.region002.gbk
CP000412.1.region001,GCA_000014405.1,1.1,CP000412.1,17283,39785,False,['lanthipeptide-class-iii'],22502,,,,,bgcflow,data/interim/antismash/7.1.0/GCA_000014405.1/CP000412.1.region001.gbk
7 changes: 4 additions & 3 deletions .examples/lanthipeptide_lactobacillus/project_config.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
name: lanthipeptide_lactobacillus
pep_version: 2.1.0
description: 'A selection of lanthipeptides from Lactobacillus delbrueckii'
sample_table: df_regions_antismash_7.0.0.csv
sample_table: df_regions_antismash_7.1.0.csv

rules:
bigslice: TRUE
bigscape: TRUE
query-bigslice: TRUE
query-bigslice: FALSE
clinker: TRUE
interproscan: TRUE
interproscan: FALSE
mmseqs2: TRUE
getphylo: FALSE
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ rules:
clinker: TRUE
interproscan: TRUE
mmseqs2: TRUE
getphylo: TRUE
9 changes: 8 additions & 1 deletion workflow/BGC
Original file line number Diff line number Diff line change
Expand Up @@ -61,12 +61,19 @@ def extract_bgc_project_information(config, project_variable="projects", sample_
print(f" - Processing project {pep_file}", file=sys.stderr)
p = peppy.Project(pep_file, sample_table_index=sample_table_index)


# make sure each project has unique names
assert (
not p.name in df_projects["name"].unique()
), f"Project name [{p.name}] in [{pep_file}] has been used. Please use different name for each project."

# assign column types as string
for col in ["name", "samples", "rules"]:
if not col in df_projects.columns:
df_projects[col] = pd.Series(dtype=str)
else:
df_projects[col] = df_projects[col].astype(str)

# add values
df_projects.loc[p.name, "name"] = p.name
df_projects.loc[p.name, "samples"] = p.config_file
df_projects.loc[p.name, "rules"] = p.config_file
Expand Down
4 changes: 2 additions & 2 deletions workflow/Metabase
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,8 @@ def setup_metabase(token, api_url, metabase_config):
metabase_config = {
"METABASE_MIN_MEMORY": "2g",
"METABASE_MAX_MEMORY": "8g",
"METABASE_VERSION": "v0.47.0",
"METABASE_DUCKDB_PLUGIN_VERSION": "0.2.2",
"METABASE_VERSION": "v0.49.6",
"METABASE_DUCKDB_PLUGIN_VERSION": "0.2.6",
"DMB_SETUP_TOKEN": "ad0fb086-351b-4fa5-a17e-76282d2c9753",
"METABASE_HTTP": "http://localhost:3000",
"MB_IS_METABOT_ENABLED" : "true"
Expand Down
119 changes: 91 additions & 28 deletions workflow/bgcflow/bgcflow/data/bgc_downstream_prep_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,34 +10,26 @@
logging.basicConfig(format=log_format, datefmt=date_format, level=logging.DEBUG)


def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False):
def generate_symlink(path, genome_id, output_dir, selected_bgcs=False):
"""
Given an antiSMASH directory, check for changed name
"""
logging.info(f"Reading input directory: {input_dir}")
path = Path(input_dir)
if not path.is_dir():
raise FileNotFoundError(f"No such file or directory: {path}")

genome_id = path.name
outpath = Path(output_dir) / genome_id
outpath.mkdir(parents=True, exist_ok=True)
logging.debug(f"Deducting genome id as {genome_id}")

change_log = {genome_id: {}}
ctr = 0
matches = [Path(i).stem for i in selected_bgcs.split()]
matches = selected_bgcs.stem
for gbk in path.glob("*.gbk"):
if gbk.stem in matches:
logging.debug(f"MATCH: {gbk.stem}")
logging.debug(f"Found match: {gbk.stem}")
filename = gbk.name
ctr = ctr + 1
logging.info(f"Parsing file: {gbk.name}")
region = SeqIO.parse(str(gbk), "genbank")
for record in region:
logging.info(f"{gbk} {record.id}")
logging.debug(f"Processing: {gbk.name}: {record.id}")
record_log = {}
if "comment" in record.annotations:
if "structured_comment" in record.annotations:
try:
original_id = record.annotations["structured_comment"][
"antiSMASH-Data"
Expand All @@ -47,7 +39,35 @@ def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False):
logging.warning(
f"Found shortened record.id: {record.id} <- {original_id}."
)
else:
raise ValueError(f"No Structured Comments in record: {gbk.name}")

if (":" in str(record.description)) or (":" in original_id):
logging.warning(
f"Illegal character ':' found in genbank description, removing: {record.description}"
)
# Remove colon from description
record.description = record.description.replace(":", "")
original_id = original_id.replace(":", "")

# Rename antiSMASH comment
if "structured_comment" in record.annotations:
if (
"Original ID"
in record.annotations["structured_comment"][
"antiSMASH-Data"
]
):
record.annotations["structured_comment"]["antiSMASH-Data"][
"Original ID"
] = original_id

# Write new GenBank file
new_filename = filename.replace(record.id, original_id)
with open(outpath / new_filename, "w") as output_handle:
SeqIO.write(record, output_handle, "genbank")
link = outpath / new_filename
else:
# generate symlink
new_filename = filename.replace(record.id, original_id)
target_path = Path.cwd() / gbk # target for symlink
Expand All @@ -64,23 +84,66 @@ def bgc_downstream_prep(input_dir, output_dir, selected_bgcs=False):
link.unlink()
link.symlink_to(target_path)

record_log["record_id"] = record.id
record_log["original_id"] = original_id
record_log["target_path"] = str(gbk)
record_log["symlink_path"] = str(link)
else:
logging.warning(f"No Comments in record: {gbk.name}")
# Assert that the symlink was correctly generated
assert link.is_symlink(), f"Failed to create symlink: {link}"
assert (
link.resolve() == target_path
), f"Symlink {link} does not point to the correct target: {target_path}"

record_log["record_id"] = record.id
record_log["original_id"] = original_id
record_log["target_path"] = str(gbk)
record_log["symlink_path"] = str(link)

change_log = {filename: record_log}
return change_log


def bgc_downstream_prep(input_file, output_dir):
logging.info(f"Reading input file: {input_file}")
with open(input_file, "r") as file:
file_paths = [Path(f) for f in file.read().splitlines()]
change_log_containers = {}
for num, selected_bgcs in enumerate(file_paths):
input_dir = selected_bgcs.parent
logging.info(f"Reading input directory: {input_dir}")
path = Path(input_dir)
if not path.is_dir():
raise FileNotFoundError(f"No such file or directory: {path}")

# check if it has complete antiSMASH results
if (path / f"{path.name}.json").is_file():
logging.info("Found full antiSMASH record")
genome_id = path.name
else:
logging.warning("No full antiSMASH record found, unknown genome id")
genome_id = "unknown_genome_id"

change_log[genome_id][filename] = record_log
# assert 1+1==3
with open(
outpath / f"{genome_id}-change_log.json", "w", encoding="utf8"
) as json_file:
json.dump(change_log, json_file, indent=4)
assert selected_bgcs.exists(), f"File does not exist: {selected_bgcs}"
region_change_log = generate_symlink(path, genome_id, output_dir, selected_bgcs)
change_log_containers[num] = {
"genome_id": genome_id,
"value": region_change_log,
}
change_logs = {}
genome_ids = set(v["genome_id"] for v in change_log_containers.values())
for genome_id in genome_ids:
change_log = {}
for v in change_log_containers.values():
if v["genome_id"] == genome_id:
entry_name = list(v["value"].keys())[0]
change_log[entry_name] = v["value"][entry_name]
change_logs[genome_id] = change_log
logging.debug(change_logs)

logging.info(f"{genome_id}: Job done!\n")
return
for genome_id in change_logs.keys():
outpath = Path(output_dir) / genome_id
with open(
outpath / f"{genome_id}-change_log.json", "w", encoding="utf8"
) as json_file:
json.dump({genome_id: change_logs[genome_id]}, json_file, indent=4)
logging.info(f"{genome_id}: Job done!\n")


if __name__ == "__main__":
bgc_downstream_prep(sys.argv[1], sys.argv[2], sys.argv[3])
bgc_downstream_prep(sys.argv[1], sys.argv[2])
10 changes: 9 additions & 1 deletion workflow/bgcflow/bgcflow/data/get_dependencies.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import json
import logging
import re
import sys

import yaml
Expand Down Expand Up @@ -45,9 +46,16 @@ def get_dependency_version(dep, dep_key, antismash_version="7"):
if dep_key in p:
if p.startswith("git+"):
result = p.split("@")[-1]
result = result.replace("-", ".")
if dep_key == "antismash" and "-" in result:
result = re.sub(r"\-", ".", result, count=2).split("-")[
0
]
else:
result = result.replace("-", ".")
else:
result = p.split("=")[-1]

logging.debug(f"Version of {dep_key} is: {result}")
return str(result)


Expand Down
2 changes: 1 addition & 1 deletion workflow/envs/antismash.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,4 @@ dependencies:
- yaml
- pip
- pip:
- git+https://github.com/antismash/antismash.git@7-0-0
- git+https://github.com/antismash/antismash.git@7-1-0-1
2 changes: 1 addition & 1 deletion workflow/envs/bigscape.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ dependencies:
- python=3.6
- dataclasses
- hmmer
- biopython
- biopython=1.70
- fasttree
- numpy
- scipy
Expand Down
5 changes: 3 additions & 2 deletions workflow/envs/cblaster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ channels:
- default
- bioconda
dependencies:
- diamond==2.0.15
- diamond==2.1.9
- python=3.8
- pip
- pip:
- cblaster==1.3.12
- cblaster==1.3.18
2 changes: 1 addition & 1 deletion workflow/envs/checkm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ channels:
- bioconda
- defaults
dependencies:
- checkm-genome=1.1.3
- checkm-genome==1.2.2
- wget
- tar
2 changes: 1 addition & 1 deletion workflow/envs/clinker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ channels:
dependencies:
- pip
- pip:
- clinker
- clinker==0.0.28
6 changes: 3 additions & 3 deletions workflow/envs/dbt-duckdb.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ channels:
- defaults
dependencies:
- python==3.11
- python-duckdb==0.8.1
- python-duckdb==0.9.2
- unzip
- pip
- pip:
- dbt-duckdb==1.6.0
- dbt-metabase==0.9.15
- dbt-duckdb==1.7.4
- dbt-metabase==1.3.0
2 changes: 1 addition & 1 deletion workflow/envs/getphylo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ dependencies:
- fasttree
- pip
- pip:
- getphylo
- getphylo==0.2.1
2 changes: 1 addition & 1 deletion workflow/envs/seqfu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ channels:
- bioconda
- defaults
dependencies:
- seqfu=1.15.3
- seqfu=1.20.3
2 changes: 1 addition & 1 deletion workflow/notebook/automlst-wrapper.rpy.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -565,7 +565,7 @@
" df_antismash[\"complete_bgcs\"] = df_antismash[\"bgcs_count\"] - df_antismash[\"bgcs_on_contig_edge\"]\n",
" \n",
" # Select the 'complete_bgcs' and 'bgcs_on_contig_edge' columns and convert them to integers\n",
" df_antismash_completeness = df_antismash.loc[:, [\"complete_bgcs\", \"bgcs_on_contig_edge\"]].astype(int)\n",
" df_antismash_completeness = df_antismash.loc[:, [\"complete_bgcs\", \"bgcs_on_contig_edge\"]].fillna(0).astype(int)\n",
" \n",
" # Define the output file path\n",
" outfile = Path(f\"assets/iTOL_annotation/iTOL_antismash_{antismash_version}_completeness.txt\")\n",
Expand Down
6 changes: 4 additions & 2 deletions workflow/rules/antismash.smk
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,10 @@ elif antismash_major_version >= 7:
--database {params.antismash_db_path} \
--cb-general --cb-subclusters --cb-knownclusters -c {threads} $antismash_input --logfile {log} 2>> {log}

# Check if the run failed due to changed detection results
if grep -q "ValueError: Detection results have changed. No results can be reused" {log}; then
# Check if the run failed due to changed detection results or changed protocluster types
if grep -q -e "ValueError: Detection results have changed. No results can be reused" \
-e "RuntimeError: Protocluster types supported by HMM detection have changed, all results invalid" {log}
then
# Use genbank input instead
echo "Previous JSON result is invalid, starting AntiSMASH from scratch..." >> {log}
antismash --genefinding-tool {params.genefinding} --output-dir {params.folder} \
Expand Down
Loading
Loading