Skip to content

Commit

Permalink
Merge pull request #707 from dialvarezs/dev-checkm2
Browse files Browse the repository at this point in the history
Bin QC Improvements
  • Loading branch information
dialvarezs authored Dec 13, 2024
2 parents 62d6f3d + 61c23fe commit c096f9a
Show file tree
Hide file tree
Showing 59 changed files with 1,846 additions and 566 deletions.
33 changes: 0 additions & 33 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -130,36 +130,3 @@ jobs:
- name: Run pipeline with ${{ matrix.test_name }} test profile
run: |
nextflow run ${GITHUB_WORKSPACE} -profile ${{ matrix.test_name }},docker --outdir ./results
checkm:
name: Run single test to checkm due to database download
# Only run on push if this is the nf-core dev branch (merged PRs)
if: ${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/mag') }}
runs-on: ubuntu-latest

steps:
- name: Free some space
run: |
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
- name: Check out pipeline code
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4

- name: Install Nextflow
run: |
wget -qO- get.nextflow.io | bash
sudo mv nextflow /usr/local/bin/
- name: Clean up Disk space
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

- name: Download and prepare CheckM database
run: |
mkdir -p databases/checkm
wget https://zenodo.org/records/7401545/files/checkm_data_2015_01_16.tar.gz -P databases/checkm
tar xzvf databases/checkm/checkm_data_2015_01_16.tar.gz -C databases/checkm/
- name: Run pipeline with ${{ matrix.profile }} test profile
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results --binqc_tool checkm --checkm_db databases/checkm
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

- [#692](https://github.com/nf-core/mag/pull/692) - Added Nanoq as optional longread filtering tool (added by @muabnezor)
- [#692](https://github.com/nf-core/mag/pull/692) - Added chopper as optional longread filtering tool and/or phage lambda removal tool (added by @muabnezor)
- [#707](https://github.com/nf-core/mag/pull/707) - Make Bin QC a subworkflow (added by @dialvarezs)
- [#707](https://github.com/nf-core/mag/pull/707) - Added CheckM2 as an alternative bin completeness and QC tool (added by @dialvarezs)
- [#708](https://github.com/nf-core/mag/pull/708) - Added `--exclude_unbins_from_postbinning` parameter to exclude unbinned contigs from post-binning processes, speeding up Prokka in some cases (added by @dialvarezs)

### `Changed`

### `Fixed`

- [#707](https://github.com/nf-core/mag/pull/708) - Fixed channel passed as GUNC input (added by @dialvarezs)
- [#724](https://github.com/nf-core/mag/pull/724) - Fix quoting in `utils_nfcore_mag_pipeline/main.nf` (added by @dialvarezs)
- [#716](https://github.com/nf-core/mag/pull/692) - Make short read processing a subworkflow (added by @muabnezor)
- [#708](https://github.com/nf-core/mag/pull/708) - Fixed channel passed as GUNC input (added by @dialvarezs)
Expand All @@ -23,7 +26,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

| Tool | Previous version | New version |
| ------- | ---------------- | ----------- |
| CheckM | 1.2.1 | 1.2.3 |
| CheckM2 | | 1.0.2 |
| chopper | | 0.9.0 |
| GUNC | 1.0.5 | 1.0.6 |
| nanoq | | 0.10.0 |

### `Deprecated`
Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@

> Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7), 1043–1055. doi: 10.1101/gr.186072.114
- [CheckM2](https://doi.org/10.1038/s41592-023-01940-w)

> Chklovski, A., Parks, D. H., Woodcroft, B. J., & Tyson, G. W. (2023). CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods, 20(8), 1203-1212. doi: https://doi.org/10.1038/s41592-023-01940-w
- [Chopper](https://doi.org/10.1093/bioinformatics/bty149)

> De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C. NanoPack: visualizing and processing long-read sequencing data. Bioinformatics. 2018 Aug 1;34(15):2666-2669. doi: 10.1093/bioinformatics/bty149
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ The pipeline then:
- performs assembly using [MEGAHIT](https://github.com/voutcn/megahit) and [SPAdes](http://cab.spbu.ru/software/spades/), and checks their quality using [Quast](http://quast.sourceforge.net/quast)
- (optionally) performs ancient DNA assembly validation using [PyDamage](https://github.com/maxibor/pydamage) and contig consensus sequence recalling with [Freebayes](https://github.com/freebayes/freebayes) and [BCFtools](http://samtools.github.io/bcftools/bcftools.html)
- predicts protein-coding genes for the assemblies using [Prodigal](https://github.com/hyattpd/Prodigal), and bins with [Prokka](https://github.com/tseemann/prokka) and optionally [MetaEuk](https://www.google.com/search?channel=fs&client=ubuntu-sn&q=MetaEuk)
- performs metagenome binning using [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/), [MaxBin2](https://sourceforge.net/projects/maxbin2/), and/or with [CONCOCT](https://github.com/BinPro/CONCOCT), and checks the quality of the genome bins using [Busco](https://busco.ezlab.org/), or [CheckM](https://ecogenomics.github.io/CheckM/), and optionally [GUNC](https://grp-bork.embl-community.io/gunc/).
- performs metagenome binning using [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/), [MaxBin2](https://sourceforge.net/projects/maxbin2/), and/or with [CONCOCT](https://github.com/BinPro/CONCOCT), and checks the quality of the genome bins using [Busco](https://busco.ezlab.org/), [CheckM](https://ecogenomics.github.io/CheckM/), or [CheckM2](https://github.com/chklovski/CheckM2) and optionally [GUNC](https://grp-bork.embl-community.io/gunc/).
- Performs ancient DNA validation and repair with [pyDamage](https://github.com/maxibor/pydamage) and [freebayes](https://github.com/freebayes/freebayes)
- optionally refines bins with [DAS Tool](https://github.com/cmks/DAS_Tool)
- assigns taxonomy to bins using [GTDB-Tk](https://github.com/Ecogenomics/GTDBTk) and/or [CAT](https://github.com/dutilh/CAT) and optionally identifies viruses in assemblies using [geNomad](https://github.com/apcamargo/genomad), or Eukaryotes with [Tiara](https://github.com/ibe-uw/tiara)
Expand Down
82 changes: 43 additions & 39 deletions bin/combine_tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,9 @@
## Originally written by Daniel Straub and Sabrina Krakau and released under the MIT license.
## See git repository (https://github.com/nf-core/mag) for full license text.


import sys
import argparse
import os.path
import sys

import pandas as pd


Expand All @@ -19,19 +18,14 @@ def parse_args(args=None):
metavar="FILE",
help="Bin depths summary file.",
)
parser.add_argument("-b", "--binqc_summary", metavar="FILE", help="BUSCO summary file.")
parser.add_argument("-q", "--quast_summary", metavar="FILE", help="QUAST BINS summary file.")
parser.add_argument("-g", "--gtdbtk_summary", metavar="FILE", help="GTDB-Tk summary file.")
parser.add_argument("-a", "--cat_summary", metavar="FILE", help="CAT table file.")
parser.add_argument(
"-b", "--busco_summary", metavar="FILE", help="BUSCO summary file."
)
parser.add_argument(
"-c", "--checkm_summary", metavar="FILE", help="CheckM summary file."
)
parser.add_argument(
"-q", "--quast_summary", metavar="FILE", help="QUAST BINS summary file."
)
parser.add_argument(
"-g", "--gtdbtk_summary", metavar="FILE", help="GTDB-Tk summary file."
"-t", "--binqc_tool", help="Bin QC tool used", choices=["busco", "checkm", "checkm2"]
)
parser.add_argument("-a", "--cat_summary", metavar="FILE", help="CAT table file.")

parser.add_argument(
"-o",
"--out",
Expand Down Expand Up @@ -81,9 +75,7 @@ def parse_cat_table(cat_table):
)
# merge all rank columns into a single column
df["CAT_rank"] = (
df.filter(regex="rank_\d+")
.apply(lambda x: ";".join(x.dropna()), axis=1)
.str.lstrip()
df.filter(regex="rank_\d+").apply(lambda x: ";".join(x.dropna()), axis=1).str.lstrip()
)
# remove rank_* columns
df.drop(df.filter(regex="rank_\d+").columns, axis=1, inplace=True)
Expand All @@ -95,39 +87,36 @@ def main(args=None):
args = parse_args(args)

if (
not args.busco_summary
and not args.checkm_summary
not args.binqc_summary
and not args.quast_summary
and not args.gtdbtk_summary
):
sys.exit(
"No summary specified! Please specify at least BUSCO, CheckM or QUAST summary."
"No summary specified! "
"Please specify at least BUSCO, CheckM, CheckM2 or QUAST summary."
)

# GTDB-Tk can only be run in combination with BUSCO or CheckM
if args.gtdbtk_summary and not (args.busco_summary or args.checkm_summary):
# GTDB-Tk can only be run in combination with BUSCO, CheckM or CheckM2
if args.gtdbtk_summary and not args.binqc_summary:
sys.exit(
"Invalid parameter combination: GTDB-TK summary specified, but no BUSCO or CheckM summary!"
"Invalid parameter combination: "
"GTDB-TK summary specified, but no BUSCO, CheckM or CheckM2 summary!"
)

# handle bin depths
results = pd.read_csv(args.depths_summary, sep="\t")
results.columns = [
"Depth " + str(col) if col != "bin" else col for col in results.columns
]
results.columns = ["Depth " + str(col) if col != "bin" else col for col in results.columns]
bins = results["bin"].sort_values().reset_index(drop=True)

if args.busco_summary:
busco_results = pd.read_csv(args.busco_summary, sep="\t")
if not bins.equals(
busco_results["GenomeBin"].sort_values().reset_index(drop=True)
):
if args.binqc_summary and args.binqc_tool == "busco":
busco_results = pd.read_csv(args.binqc_summary, sep="\t")
if not bins.equals(busco_results["GenomeBin"].sort_values().reset_index(drop=True)):
sys.exit("Bins in BUSCO summary do not match bins in bin depths summary!")
results = pd.merge(
results, busco_results, left_on="bin", right_on="GenomeBin", how="outer"
) # assuming depths for all bins are given

if args.checkm_summary:
if args.binqc_summary and args.binqc_tool == "checkm":
use_columns = [
"Bin Id",
"Marker lineage",
Expand All @@ -147,22 +136,37 @@ def main(args=None):
"4",
"5+",
]
checkm_results = pd.read_csv(args.checkm_summary, usecols=use_columns, sep="\t")
checkm_results = pd.read_csv(args.binqc_summary, usecols=use_columns, sep="\t")
checkm_results["Bin Id"] = checkm_results["Bin Id"] + ".fa"
if not bins.equals(
checkm_results["Bin Id"].sort_values().reset_index(drop=True)
):
if not set(checkm_results["Bin Id"]).issubset(set(bins)):
sys.exit("Bins in CheckM summary do not match bins in bin depths summary!")
results = pd.merge(
results, checkm_results, left_on="bin", right_on="Bin Id", how="outer"
) # assuming depths for all bins are given
results["Bin Id"] = results["Bin Id"].str.removesuffix(".fa")

if args.binqc_summary and args.binqc_tool == "checkm2":
use_columns = [
"Name",
"Completeness",
"Contamination",
"Completeness_Model_Used",
"Coding_Density",
"Translation_Table_Used",
"Total_Coding_Sequences",
]
checkm2_results = pd.read_csv(args.binqc_summary, usecols=use_columns, sep="\t")
checkm2_results["Name"] = checkm2_results["Name"] + ".fa"
if not set(checkm2_results["Name"]).issubset(set(bins)):
sys.exit("Bins in CheckM2 summary do not match bins in bin depths summary!")
results = pd.merge(
results, checkm2_results, left_on="bin", right_on="Name", how="outer"
) # assuming depths for all bins are given
results["Name"] = results["Name"].str.removesuffix(".fa")

if args.quast_summary:
quast_results = pd.read_csv(args.quast_summary, sep="\t")
if not bins.equals(
quast_results["Assembly"].sort_values().reset_index(drop=True)
):
if not bins.equals(quast_results["Assembly"].sort_values().reset_index(drop=True)):
sys.exit("Bins in QUAST summary do not match bins in bin depths summary!")
results = pd.merge(
results, quast_results, left_on="bin", right_on="Assembly", how="outer"
Expand Down
6 changes: 4 additions & 2 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -160,12 +160,14 @@ process {
cpus = { 8 * task.attempt }
memory = { 20.GB * task.attempt }
}

withName: MAXBIN2 {
errorStrategy = { task.exitStatus in [1, 255] ? 'ignore' : 'retry' }
}

withName: DASTOOL_DASTOOL {
errorStrategy = { task.exitStatus in ((130..145) + 104) ? 'retry' : task.exitStatus == 1 ? 'ignore' : 'finish' }
}
//CheckM2 returns exit code 1 when Diamond doesn't find any hits
withName: CHECKM2_PREDICT {
errorStrategy = { task.exitStatus in (130..145) ? 'retry' : task.exitStatus == 1 ? 'ignore' : 'finish' }
}
}
34 changes: 30 additions & 4 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,11 @@ process {
withName: CHECKM_LINEAGEWF {
tag = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}_wf" }
publishDir = [path: { "${params.outdir}/GenomeBinning/QC/CheckM" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename }]
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: CHECKM_QA {
Expand All @@ -418,9 +422,31 @@ process {
]
}

withName: COMBINE_CHECKM_TSV {
ext.prefix = { "checkm_summary" }
publishDir = [path: { "${params.outdir}/GenomeBinning/QC" }, mode: params.publish_dir_mode, saveAs: { filename -> filename.equals('versions.yml') ? null : filename }]
withName: COMBINE_BINQC_TSV {
ext.prefix = { "${params.binqc_tool}_summary" }
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: CHECKM2_DATABASEDOWNLOAD {
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM2/checkm2_downloads" },
mode: params.publish_dir_mode, overwrite: false,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
enabled: params.save_checkm2_data
]
}

withName: CHECKM2_PREDICT {
ext.prefix = { "${meta.assembler}-${meta.binner}-${meta.domain}-${meta.refinement}-${meta.id}" }
publishDir = [
path: { "${params.outdir}/GenomeBinning/QC/CheckM2" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: GUNC_DOWNLOADDB {
Expand Down
2 changes: 1 addition & 1 deletion conf/test_adapterremoval.config
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ params {
skip_krona = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2024-01-08.tar.gz"
binqc_tool = 'checkm'
skip_gtdbtk = true
gtdbtk_min_completeness = 0.01
clip_tool = 'adapterremoval'
Expand Down
3 changes: 1 addition & 2 deletions conf/test_bbnorm.config
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,7 @@ params {
skip_krona = true
min_length_unbinned_contigs = 1
max_unbinned_contigs = 2
busco_db = "https://busco-data.ezlab.org/v5/data/lineages/bacteria_odb10.2024-01-08.tar.gz"
busco_clean = true
binqc_tool = 'checkm2'
skip_gtdbtk = true
gtdbtk_min_completeness = 0.01
bbnorm = true
Expand Down
30 changes: 28 additions & 2 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -554,7 +554,7 @@ Besides the reference files or output files created by BUSCO, the following summ

#### CheckM

[CheckM](https://ecogenomics.github.io/CheckM/) CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage
[CheckM](https://ecogenomics.github.io/CheckM/) provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage

By default, nf-core/mag runs CheckM with the `check_lineage` workflow that places genome bins on a reference tree to define lineage-marker sets, to check for completeness and contamination based on lineage-specific marker genes. and then subsequently runs `qa` to generate the summary files.

Expand All @@ -564,7 +564,8 @@ By default, nf-core/mag runs CheckM with the `check_lineage` workflow that place
- `GenomeBinning/QC/CheckM/`
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_qa.txt`: Detailed statistics about bins informing completeness and contamamination scores (output of `checkm qa`). This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]_wf.tsv`: Overall summary file for completeness and contamination (output of `checkm lineage_wf`).
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: Intermediate files for CheckM results, including CheckM generated annotations, log, lineage markers etc.
- `GenomeBinning/QC/`
- `checkm_summary.tsv`: A summary table of the CheckM results for all bins (output of `checkm qa`).

</details>
Expand All @@ -580,6 +581,31 @@ If the parameter `--save_checkm_reference` is set, additionally the used the Che

</details>

#### CheckM2

[CheckM2](https://github.com/chklovski/CheckM2) is a tool for assessing the quality of metagenome-derived genomes. It uses a machine learning approach to predict the completeness and contamination of a genome regardless of its taxonomic lineage.

<details markdown="1">
<summary>Output files</summary>

- `GenomeBinning/QC/CheckM2/`
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/quality_report.tsv`: Detailed statistics about bins informing completeness and contamamination scores. This should normally be your main file to use to evaluate your results.
- `[assembler]-[binner]-[domain]-[refinement]-[sample/group]/`: Intermediate files for CheckM2 results, including CheckM2 generated annotations, log, and DIAMOND alignment results.
- `GenomeBinning/QC/`
- `checkm2_summary.tsv`: A summary table of the CheckM2 results for all bins.

</details>

If the parameter `--save_checkm2_data` is set, the CheckM2 reference datasets will be stored in the output directory.

<details markdown="1">
<summary>Output files</summary>

- `GenomeBinning/QC/CheckM2/`
- `checkm2_downloads/CheckM2_database/*.dmnd`: Diamond database used by CheckM2.

</details>

#### GUNC

[Genome UNClutterer (GUNC)](https://grp-bork.embl-community.io/gunc/index.html) is a tool for detection of chimerism and contamination in prokaryotic genomes resulting from mis-binning of genomic contigs from unrelated lineages. It does so by applying an entropy based score on taxonomic assignment and contig location of all genes in a genome. It is generally considered as a additional complement to CheckM results.
Expand Down
Loading

0 comments on commit c096f9a

Please sign in to comment.