Skip to content

Clair3-RNA - a long-read small variant caller for RNA sequencing data

License

Notifications You must be signed in to change notification settings

HKU-BAL/Clair3-RNA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clair3-RNA - A deep learning-based small variant caller for long-read RNA sequencing data

License

Contact: Ruibang Luo, Zhenxian Zheng
Email: {rbluo,zxzheng}@cs.hku.hk


Introduction

Clair3-RNA is a small variant caller for long-read RNA sequencing (lrRNA-seq) data. Clair3-RNA supports ONT complementary DNA sequencing (cDNA) and direct RNA sequencing (dRNA). dRNA sequencing support the ONT latest SQK-RNA004 kit data for variant calling. Clair3-RNA also supports PacBio Sequel and PacBio MAS-Seq RNA sequencing data. Clair3-RNA reached a ~95% F1-score for ONT dRNA using SQK-RNA004 kit and ~96% F1-score using PacBio Iso-Seq and MAS-Seq, respectively, with at least ten supporting reads and disregarding the zygosity. With read phased, the performance reached ~97% for ONT and ~98% for PacBio.

A preprint describing Clair3-RNA's algorithms and results is at bioRxiv.

For germline small variant calling, please use Clair3.

For somatic small variant calling using a tumor-normal pair, please try ClairS.

For somatic small variant calling using tumor sample only, please try ClairS-TO.


Contents


Latest Updates

v0.2.1 (Dec 2, 2024): 1. Fixed a bug that misses some variants when --print_ref_calls is enabled (#6). 2. Added the --enable_variant_calling_at_sequence_head_and_tail option to enable variant calling at the head and tail 16bp of each sequence. Use with caution because alignments are less reliable in the regions, and there would be insufficient context to be fed to the neural network for reliable calling.

v0.2.0 (Nov 18, 2024) : 1. Added a new pileup phasing model (enable by using --enable_phasing_model opiton) for ONT dRNA004(ont_dorado_drna004), PacBio Iso-Seq(hifi_sequel2_minimap2), and PacBio MAS-Seq(hifi_mas_minimap2), the SNP performance improved by ~2% and Indel performance improved by ~6%. 2. Fixed some formatting issues in the calling workflow.

v0.1.0 (Aug 15, 2024) : 1. Added a new ONT dRNA004 direct RNA sequencing model (ont_dorado_drna004) for SQK-RNA004 kit. 2. Added new PacBio Sequel (hifi_sequel2_minimap2) and Revio (hifi_mas_minimap2) model to support minimap2 alignment. 3. Enhance model training techniques to boost performance by incorporating strategies such as managing low-coverage sites, verifying variant zygosity, filtering RNA editing sites, etc. 4. Renamed all ONT and PacBio model names, check here for more details.

v0.0.1 (Nov 27, 2023): Initial release for early access.


Quick Demo

Quick start

After following installation, you can run Clair3-RNA with one command:

./run_clair3_rna -B input.bam -R ref.fa -o output -t 8 -p ont_dorado_drna004
## Final output file: output/output.vcf.gz

Check Usage for more options.



Pre-trained Models

Clair-RNA was trained using GIAB RNA sequencing data. All models were trained with chr20 excluded (including only chr1-19, 21, 22).

Platform Chemistry/Kit/Instruments Basecaller Aligner Option (-p/--platform) Reference Training samples
ONT SQK-RNA004 kit, direct RNA sequencing Dorado minimap2 ont_dorado_drna004 GRCh38 HG002
ONT SQK-RNA002 kit, direct RNA sequencing Guppy minimap2 ont_guppy_drna002 GRCh38 HG002
ONT R9.4.1, complementary DNA sequencing Guppy minimap2 ont_guppy_cdna GRCh38 HG002
PacBio HiFi Sequel with Iso-Seq kit - pbmm2/minimap2 hifi_sequel2_pbmm2, hifi_sequel2_minimap2 GRCh38 HG002
PacBio HiFi Revio with MAS-Seq kit - pbmm2/minimap2 hifi_mas_pbmm2, hifi_mas_minimap2 GRCh38 HG002

Installation

Option 1. Docker pre-built image

A pre-built docker image is available at DockerHub.

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR in docker.

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clair3-rna:latest \
  /opt/bin/run_clair3_rna \
  --bam_fn ${INPUT_DIR}/input.bam \      ## use your input bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \         ## use your reference file name here
  --threads ${THREADS} \                 ## maximum threads to be used
  --platform ${PLATFORM} \               ## options: {ont_dorado_drna004, ont_guppy_drna002, ont_guppy_cdna, hifi_sequel2_pbmm2, hifi_sequel2_minimap2, hifi_mas_pbmm2, hifi_sequel2_minimap2}
  --tag_variant_using_readiportal \      ## optional, tag variants using REDIportal dataset
  --enable_phasing_model \               ## optional, enable calling using phasing model
  --output_dir ${OUTPUT_DIR}             ## output path prefix 

Check Usage for more options.

Option 2. Singularity

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR in singularity.

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
mkdir -p ${OUTPUT_DIR}

conda config --add channels defaults
conda create -n singularity-env -c conda-forge singularity -y
conda activate singularity-env

# singularity pull docker pre-built image
singularity pull docker://hkubal/clair3-rna:latest

# run the sandbox like this afterward
singularity exec \
  -B ${INPUT_DIR},${OUTPUT_DIR} \
  clair3-rna_latest.sif \
  /opt/bin/run_clair3_rna \
  --bam_fn ${INPUT_DIR}/input.bam \            ## use your input bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \               ## use your reference file name here
  --threads ${THREADS} \                       ## maximum threads to be used
  --platform ${PLATFORM} \                     ## options: {ont_dorado_drna004, ont_guppy_drna002, ont_guppy_cdna, hifi_sequel2_pbmm2, hifi_sequel2_minimap2, hifi_mas_pbmm2, hifi_sequel2_minimap2}
  --tag_variant_using_readiportal \            ## optional, tag variants using REDIportal dataset
  --enable_phasing_model \                     ## optional, enable calling using phasing model 
  --output_dir ${OUTPUT_DIR} \                 ## output path prefix
  --conda_prefix /opt/conda/envs/clair3_rna

Option 3. Build an anaconda virtual environment

Anaconda install:

Please install anaconda using the official guide or using the commands below:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x ./Miniconda3-latest-Linux-x86_64.sh 
./Miniconda3-latest-Linux-x86_64.sh

Install Clair3-RNA using anaconda step by step:

# create and activate an environment named clair3_rna
# install pypy and packages in the environment
conda create -n clair3_rna -c conda-forge -c bioconda clair3 mosdepth bedtools -y
source activate clair3_rna

git clone https://github.com/HKU-BAL/Clair3-RNA.git
cd Clair3-RNA

# make sure in conda environment
# download pre-trained models
echo ${CONDA_PREFIX}
mkdir -p ${CONDA_PREFIX}/bin/clair3_rna_models
wget http://www.bio8.cs.hku.hk/clair3_rna/models/clair3_rna_models.tar.gz
tar -zxvf clair3_rna_models.tar.gz -C ${CONDA_PREFIX}/bin/clair3_rna_models/

./run_clair3_rna --help

Option 4. Docker Dockerfile

This is the same as option 1 except that you are building a docker image yourself. Please refer to option 1 for usage.

git clone https://github.com/HKU-BAL/Clair3-RNA.git
cd Clair-RNA

# build a docker image named hkubal/clairs:latest
# might require docker authentication to build docker image
docker build -f ./Dockerfile -t hkubal/clair3-rna:latest .

# run the docker image like option 1
docker run -it hkubal/clair3-rna:latest /opt/bin/clair3_rna --help

Usage

General Usage

./run_clair3_rna \
  --bam_fn ${INPUT_DIR}/input.bam \          ## use your input bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \             ## use your reference file name here
  --threads ${THREADS} \                     ## maximum threads to be used
  --platform ${PLATFORM} \                   ## options: {ont_dorado_drna004, ont_guppy_drna002, ont_guppy_cdna, hifi_sequel2_pbmm2, hifi_sequel2_minimap2, hifi_mas_pbmm2, hifi_sequel2_minimap2}
  --tag_variant_using_readiportal \          ## optional, tag variants using REDIportal dataset
  --enable_phasing_model \                   ## optional, enable calling using phasing model
  --output_dir ${OUTPUT_DIR}                 ## output path prefix
 
## Final output file: ${OUTPUT_DIR}/output.vcf.gz

Options

Required parameters:

  -B BAM_FN, --bam_fn BAM_FN
                        RNA BAM file input. The input file must be samtools indexed.
  -R REF_FN, --ref_fn REF_FN
                        FASTA reference file input. The input file must be samtools indexed.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        VCF output directory.
  -t THREADS, --threads THREADS
                        Max #threads to be used.
  -p PLATFORM, --platform PLATFORM
                        Select the sequencing platform of the input. Possible options: {ont_dorado_drna004, ont_guppy_drna002, ont_guppy_cdna, hifi_sequel2_pbmm2, hifi_sequel2_minimap2, hifi_mas_pbmm2, hifi_sequel2_minimap2}.

Miscellaneous parameters:

  -P PILEUP_MODEL_PATH, --pileup_model_path PILEUP_MODEL_PATH
                        Specify the path prefix to your own pileup model. Including ${pileup_model_path}.data-00000-of-00001, ${pileup_model_path}.index.
  -c CTG_NAME, --ctg_name CTG_NAME
                        The name of the contigs to be processed. Split by ',' for multiple contigs. Default: call in chr{1..22} and {1..22}.
  -r REGION, --region REGION
                        A region to be processed. Format: `ctg_name:start-end` (start is 1-based).
  -b BED_FN, --bed_fn BED_FN
                        Path to a BED file. Call variants only in the provided BED regions.
  -G GENOTYPING_MODE_VCF_FN, --genotyping_mode_vcf_fn GENOTYPING_MODE_VCF_FN
                        VCF file input containing candidate sites to be genotyped. Variants will only be called at the sites in the VCF file if provided.
  -q QUAL, --qual QUAL  If set, variants with >QUAL will be marked as PASS, or LowQual otherwise.
  --enable_phasing_model
                        Enable phasing with whatshap or longphase. Usually leads to performance improvement when coverage is sufficient. Default: False.  
  --snp_min_af SNP_MIN_AF
                        Minimal SNP AF required for a variant to be called. Decrease SNP_MIN_AF might increase a bit of sensitivity, but in trade of precision, speed and accuracy. Default: 0.08.
  --indel_min_af INDEL_MIN_AF
                        Minimum Indel AF required for a candidate variant to be called. Default: ont:0.15,hifi:0.08.
  --min_coverage MIN_COVERAGE
                        Minimal coverage required for a variant to be called. Default: 4.
  --tag_variant_using_readiportal
                        Tag variants uisng REDIportal dataset, If set, called variants that are also in the readiportal dataset will be marked as "RNAEditing". Default: disable.
  --readiportal_database_filter_tag READIPORTAL_DATABASE_FILTER_TAG
                        Use only editing sites with these tags in the readiportal dataset, split by ":" for multiple tags. Default: using sites supported by two or more sources - "A,D:A,R:A,R,D".
  --readiportal_reference_genome_version READIPORTAL_REFERENCE_GENOME_VERSION
                        Select the reference genome version in the readiportal dataset. Possible options: {grch38, grch37}. Default: "grch38".
  --chunk_size CHUNK_SIZE
                        The size of each chuck for parallel processing. Default: 5000000.
  -s SAMPLE_NAME, --sample_name SAMPLE_NAME
                        Define the sample name to be shown in the VCF file. Default: SAMPLE.
  --output_prefix OUTPUT_PREFIX
                        Prefix for output VCF filename. Default: output.
  --remove_intermediate_dir
                        Remove intermediate directory before finishing to save disk space.
  --include_all_ctgs    Call variants on all contigs, otherwise call in chr{1..22} and {1..22}.
  --print_ref_calls     Show reference calls (0/0) in VCF file.
  -d, --dry_run         Print the commands that will be ran.
  --min_mq MIN_MQ       Minimal mapping quality required for an alignment to be considered. Default: 5.
  --phased_pileup_model_path PHASED_PILEUP_MODEL_PATH
                        Specify the path prefix to your own pileup phasing model. Including ${phased_pileup_model_path}.data-00000-of-00001, ${phased_pileup_model_path}.index.
  --python PYTHON       Absolute path of python, python3 >= 3.9 is required.
  --pypy PYPY           Absolute path of pypy3, pypy3 >= 3.6 is required.
  --samtools SAMTOOLS   Absolute path of samtools, samtools version >= 1.10 is required.
  --parallel PARALLEL   Absolute path of parallel, parallel >= 20191122 is required.
  --longphase LONGPHASE
                        Absolute path of longphase, longphase >= 1.7 is required.
  --whatshap WHATSHAP   Absolute path of whatshap, whatshap >= 1.0 is required.
  --use_longphase_for_intermediate_phasing USE_LONGPHASE_FOR_INTERMEDIATE_PHASING
                        Use longphase for intermediate phasing. Default:False.
  --use_longphase_for_intermediate_haplotagging USE_LONGPHASE_FOR_INTERMEDIATE_HAPLOTAGGING
                        Use longphase for intermediate haplotagging. Default:False.
  --enable_variant_calling_at_sequence_head_and_tail
                        EXPERIMENTAL: Enable variant calling at the head and tail 16bp of each sequence. Default: disable.

Call variants in one or multiple chromosomes using the -C/--ctg_name parameter

./run_clair3_rna -B input.bam -R ref.fa -o output -t 8 -p ont_dorado_drna004 -C chr21,chr22

Call variants in one specific region using the -r/--region parameter

./run_clair3_rna -B input.bam -R ref.fa -o output -t 8 -p ont_dorado_drna004 -r chr20:1000000-2000000

Call variants at interested variant sites (genotyping) using the -G/--genotyping_mode_vcf_fn parameter

./run_clair3_rna -B input.bam -R ref.fa -o output -t 8 -p ont_dorado_drna004 -G input.vcf

Call variants in the BED regions using the -b/--bed_fn parameter

We highly recommended using BED file to define multiple regions of interest like:

./run_clair3_rna -B input.bam -R ref.fa -o output -t 8 -p ont_dorado_drna004 -b input.bed

Tag RNA editing site using REDIportal dataset

RNA undergoes editing by ADAR (adenosine deaminases acting on RNA), resulting in Adenosine-to-inosine (A-to-I) changes. These A-to-I changes can be observed in RNA-seq datasets as A-to-G and T-to-C changes, which do not represent genuine RNA variants. To address this, we provide users with the option to utilize external datasets such as REDIportal to annotate RNA editing sites. In Clair3-RNA's VCF output, variants that are also RNA editing sites reported in REDIportal can be tagged. These sites will be marked as RNAEditing instead of PASS in the FILTER column when the --tag_variant_using_readiportal option is enabled.

Caution: --tag_variant_using_readiportal option currently works for GRCh38 and GRCh37 reference genome only, use can specify the reference genome version by using option --readiportal_reference_genome_version={grch38, grch37}.