Skip to content

Compactors

Adam Gudyś edited this page Nov 6, 2024 · 2 revisions

Compactors is a new statistical approach to local seed-based assembly. It comes as a part of SPLASH package and was particularly suited to assemble regions divere across across samples (see figure below). However, it can be used as an independent assembler on any types of seeds provided by the user.

compactors-idea-v2

Installation

The software is distributed as a part of SPLASH package as a separate binary named compactors. Please follow SPLASH installation instructions in order to install compactors.

Toy examples

Short reads

Generate compactorts from samples listed in fastq.list seeded at anchors contained in anchors.tsv and store them in compactors.tsv. Compactors are extended by segments consisting of 2 * 27 (num_kmers * kmer_len) nucleotides. Assuming the anchors are 27 nucleotides long as well, this configuration is suitable for short reads of at least 81 bases. The maximum compactor length is by default set to 2000 bases.

compactors fastq.list anchors.tsv compactors.tsv

Increased sensitivity

In the following example, several parameters increasing compactors's sensitivity at the cost of higher false positive rate has been altered.

compactors fastq.list anchors.tsv compactors.tsv --epsilon 0.01 --beta 1 --lower_bound 2

More extensions

The stringency of compactors' extension can be loosened by decreasing specificity requirement --min_extender_specificity. Additionally, the algorithm by default considers only the last k-mer as a potential extender. However, k-mers preceding it can be also considered by redefining --num_extenders and --extenders_shift parameters at the cost of increased computational time. The procedure operates from the end of a compactor and stops at the first extender fulfilling the requirement. In the following example, the 27-mers at shifts 0, 5, and 10 nucleotides from the current compactors' end are checked for the specificity criterion (which has been also loosened slightly).

compactors fastq.list anchors.tsv compactors.tsv --min_extender_specificity 0.8 --num_extenders 3 --extenders_shift 5

Long reads

When analyzing long read data, the length of a compactor segment can be significantly increased. In the example below, it is set to 2700 bases (num_kmers * kmer_len) while the maximum length of a compactor has been increased to 100 000 bases.

compactors fastq.list anchors.tsv compactors.tsv --num_kmers 100 --kmer_len 27 --max_length 100000

Parameters

A short help is printed after running the executable without arguments.

compactors [options] <fastq_list> <anchors_tsv> <compactors_tsv>

Positional parameters:

  • fastq_list - input file with a list of FASTQ/FASTA files to be queried for anchors (one per line)
  • anchors_tsv - input tsv file with anchors (seeds)
  • compactors_tsv - output tsv with compactors

Options:

  • --input_format <fastq|fasta> - input format (default: fastq)

  • --num_kmers <int> - number of kmers in a compactor segment (default: 2, does not include an anchor/seed)

  • --kmer_len <int> - length of kmers in a compactor segment (default: 27, max: 31)

  • --epsilon <real> - sequencing error (default: 0.05)

  • --beta <real> - beta parameter for active set generation, lower values increase sensitivity (default: 5)

  • --lower_bound <int> - minimum kmer abundance to add it to an active set, lower values increase sensitivity (default: 10)

  • --max_mismatch <int> - maximum mismatch count for compactor candidates (default: 4)

  • --all_anchors - find all anchors' occurences in a read, not just the first one (default: off)

  • --no_extension - disable recursive extension (default: enabled)

  • --max_length <int> - maximum compactor length in bases (used only with recursion; default: 2000)

  • --min_extender_specificity <real> - minimum extender specificity for current anchor to allow extensions (default: 0.9)

  • --num_extenders <int> - number of extender candidates to be verified starting from the very end of the compactor (default: 1)

  • --extenders_shift <int> - shift in bases between extender candidates to be verified (default: 1)

  • --max_anchor_compactors <int> - maximum number of compactors that can originate from an anchor (default: 1000)

  • --max_child_compactors <int> - maximum number of child compactors produced at each extension step (default: 20)

  • --extend_all - if multiple compactors for a given anchor end with same extender, all are extended (default: off)

  • --out_fasta <name> - name of the optional compactor FASTA (not generated by default)

  • --no_subcompactors - do not include subcompactors in the output TSV (default: off)

  • --cumulated_stats - include columns with cumulated stats in the output TSV (default: off)

  • --independent_outputs - run compactors independently on input FASTQ files (default: off); output file names are preceded by the name of input FASTQs

  • --num_threads <int> - number of threads

  • --reads_buffer_gb <int> - size of the read buffer in GB (default: 24)

  • --keep_temp - keep temporary files after the analysis, use previously generated temporary files (if exist) during the analysis (default: off)

The output of the program is a TSV table with compactors ordered increasingly by number of segments and then alphabetically w.r.t anchor and then decreasingly w.r.t support. The table contains the following columns:

  • anchor - anchor sequence,
  • compactor - compactor sequence,
  • id - compactor numerical identifier,
  • parent_id - identifier of compactor's parent (-1 if no parent exists),
  • exact_support - how many times the compactor was observed in reads exactly (if the compactor was extended, it concerns the part of the compactor added in the last extension),
  • support - how many times the compactor was observed in reads up to max_mismatch errors (if the compactor was extended, it concerns the part of the compactor added in the last extension),
  • extender_specificity - how the last investigated extender is specific for the seed of the current compactor's segment (if specificity > min_extender_specificity, an extension is performed),
  • extender_shift - shift in bases of the last investigated extender from the compactor end,
  • total_length - total compactor length in bases,
  • num_extended - how many times the compactor was extended,
  • expected_read_count - expected number of compactor occurences in the reads,
  • cumulated_id - comma-separated list of identifiers of all compactor segments (only with --cumulated_stats flag),
  • cumulated_exact_support - comma-separated list of exact_support of all compactor segments (only with --cumulated_stats flag),
  • cumulated_extender_specificity - comma-separated list of extender_specificity of all compactor segments (only with --cumulated_stats flag),

The optional FASTA file can also be created by using --out_fasta option.