Skip to content

1. Deployment

Philip Reiner Kensche edited this page Oct 25, 2021 · 5 revisions

The workflow is based on the workflow management system Roddy. In order to run the workflow you need a working installation of Roddy. If you have never worked with Roddy before, please read about Roddy and its installation in the Roddy documentation, in particular about how to resolve plugin dependencies.

Roddy Version and Dependent Plugin Versions

The specific Roddy and COWorkflowsBasePlugin versions needed for the workflow are listed in the buildinfo.txt file associated with the workflow version that you want to install. You should use a tagged version of the workflow to ensure the information in that file is up to date.

Conda

The workflow contains a description of a Conda environment as a Conda YAML file. A number of Conda packages from BioConda are required. You should set up the Conda environment at a centralized position available from all compute hosts.

First install the BioConda channels:

conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels bioconda-legacy

Then install the environment with something like

conda env create -n AlignmentAndQCWorkflows -f $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/qcAnalysis/environments/conda.yml

The name of the Conda environment is arbitrary but needs to be consistent with the condaEnvironmentName variable in the configuration. The default for that variable is set in resources/configurationFiles/qcAnalysis.xml.

We successfully tested the Conda environment imported as described above and using the parameters useBioBamBamSort=false, markDuplicatesVariant=sambamba, workflowEnvironmentScript=workflowEnvironment_conda and condaEnvironmentName=AlignmentAndQCWorkflows on WGS data.

A Note on the Conda Environment

The AlignmentAndQCWorkflows plugin there are the following differences between the DKFZ-ODCF software stack that is reflected in the resources/analysisTools/qcPipeline/environments/tbi-lsf-cluster.sh environment file and the XML configurations with the _VERSION variables, and Conda environment:

Package DKFZ version Conda version Comment
biobambam 0.0.148 2.0.79 As long as you do not select markDuplicatesVariant=biobambam this won't be a problem, as biobambam is only used for sorting BAMs. Note further, we did not manage to get bamsort 2 from Conda to run on a CentOS 7 VM. You can also use useBioBamBamSort=false to sort with samtools.
picard 1.125 1.126 Probably no big deal.
bwa patched 0.7.8 0.7.8 For the WGBS workflow we currently use a patched version of BWA that does not check for the "/1" and "/2" first and second read marks. This version is not available in BioConda and thus the WGBS workflow won't work with the Conda environment.
R 3.4.0 3.4.1 Probably no big deal.
trimmomatic 0.30 0.33

Note further that the Conda environment is probably outdated and packages may not be compatible with recent Conda versions or may even be lost from the referenced channels. It might be possible to fix this by including the bioconda-legacy channel. Because of these and other problems of Conda that render this tool (alone) almost unusable for creating reproducibility, we cannot provide you much support with the environments.

WGBS Data Processing and methylCtools

The current implementation of the WGBS workflow uses methylCtools requires a patched BWA version. Note that the methylCtools version in this repository differs from the one in the official Github repository in that the version .

Recompiling the D-based Components

Two programs in this repository -- genomeCoverage.d and coverageQc.d -- were written in the programming language D and are provided as binaries and source code. If the need arises to recompile them you can find the build instructions in resources/analysisTools. For the compilation you will need

  • the D-compiler LDC 0.12.1 compiler
  • and BioD master branch (commit 8b633de)

mbuffer

The programm mbuffer is used as a more powerful tee alternative and to buffer large data amounts against temporary I/O slowdowns. Unfortunately, in particular in old versions of the workflow, some mbuffer calls are in chains of piped commands and errors associated with mbuffer are not correctly caught. You may encounter the following situation:

Overall the "alignAndPairSlim" or "mergeAndMarkDuplicatesSlim" job does not finish but running processes in the jobs are blocking (no I/O, no CPU). The actual alignment with bwa has finished without errors, and bwa and flags_isizes_PEaberrations.pl have finished without errors. Still other process are running, in particular genomeCoverage, while filter_readbins.pl has ended with an error message, that there is no input ("from 0 lines, kept 0 with selected chromosomes"). The problem may be that the mbuffer between the latter two processes did not even start to connect genomeCoverage and filter_readbins.pl, because it could not allocate memory due to too large "blocksize". Newer versions of mbuffer choose this value dynamically.

To fix this problem, for the user that executes the workflow on the cluster, configure ~/.mbuffer.rc

blocksize = 4096