Skip to content

Latest commit

 

History

History
426 lines (355 loc) · 15.8 KB

README.md

File metadata and controls

426 lines (355 loc) · 15.8 KB

🔴🟢🔵⚫️ monte barcode

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

Generating sets of random DNA sequences optimized for use in high-throughput sequencing.

Installation

The easy way

Install the pre-compiled version from PyPI:

pip install monte-barcode

From source

Clone the repository, then cd into it. Then run:

pip install -e .

Usage

monte barcode provides command line utilities to generate completely random or peptide-encoding barcodes conforming to custom contraints, like minimum edit distance among the set, GC content, and color balance for Illumina chemistry.

Barcode sets and individual barcodes are deterministically given an adjective-noun mnemonic (generated by nemony) for easy reference.

Each utility writes a lot of commentary to stderr, but the barcodes go to stdout by default so they can be piped.

Command line

Generate random barcodes of a particular length.

$ monte barcode --length 6 -n 5
Generating barcodes with the following parameters:
       ...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 16 barcodes, rejected 11, accepted 5; rejection rate is 0.69

Rejection reasons:
        gc_content: 0.62
        homopolymer: 0.25
        restriction_sites: 0.06
mighty_orchid:l6-n5-d3:x0:fresh_prague  TGAGGT
mighty_orchid:l6-n5-d3:x1:flexible_forest       AGTTCG
mighty_orchid:l6-n5-d3:x2:fun_baby      GACATC
mighty_orchid:l6-n5-d3:x3:woolly_podium TGTCCT
mighty_orchid:l6-n5-d3:x4:strong_factor GAACCA
Wrote barcode set called mighty_orchid, with minimum Hamming distance 3 and maximum Hamming distance 6.

Or encoding a peptide.

$ monte barcode --amino-acid HELP -n 5
Generating barcodes with the following parameters:
        ...
Using amino acid sequence HELP with length 12 and 96 possible combinations.
> Tried 7 barcodes, rejected 2, accepted 5; rejection rate is 0.29

Rejection reasons:
        gc_content: 0.14
        homopolymer: 0.14
basic_hamlet:l12-n5-d2:x0:volatile_lesson       CATGAGCTGCCT
basic_hamlet:l12-n5-d2:x1:pricy_scuba   CACGAACTGCCT
basic_hamlet:l12-n5-d2:x2:good_race     CACGAATTGCCA
basic_hamlet:l12-n5-d2:x3:demanding_bruno       CATGAATTACCG
basic_hamlet:l12-n5-d2:x4:pawky_plaster CATGAGTTACCT
Wrote barcode set called basic_hamlet, with minimum Hamming distance 2 and maximum Hamming distance 4.

Insist on a minimum edit distance.

$ monte barcode --length 6 -n 10 -d 3
Generating barcodes with the following parameters:
       ...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 39 barcodes, rejected 29, accepted 10; rejection rate is 0.74

Rejection reasons:
        gc_content: 0.67
        distance: 0.13
        homopolymer: 0.05
scenic_blast:l6-n10-d3:x0:acidic_turtle TGTGTG
scenic_blast:l6-n10-d3:x1:rowdy_grace   ACCATC
scenic_blast:l6-n10-d3:x2:rich_export   CGTTAG
scenic_blast:l6-n10-d3:x3:unique_break  GGAATC
scenic_blast:l6-n10-d3:x4:careful_fuji  GCAAGT
scenic_blast:l6-n10-d3:x5:whimsical_derby       CGGAAT
scenic_blast:l6-n10-d3:x6:pricy_aloha   TTCTCC
scenic_blast:l6-n10-d3:x7:zestful_ricardo       AGAGCT
scenic_blast:l6-n10-d3:x8:terse_cobra   AAGTCC
scenic_blast:l6-n10-d3:x9:zany_chamber  TTACGG
Wrote barcode set called scenic_blast, with minimum Hamming distance 3 and maximum Hamming distance 6.

Or insist on ideal color balance for Illumina chemistry.

$ monte barcode --length 6 -n 10 -d 3 --color
Generating barcodes with the following parameters:
        ...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 151 barcodes, rejected 141, accepted 10; rejection rate is 0.93

Rejection reasons:
        gc_content: 0.65
        homopolymer: 0.21
        color_balance: 0.72
        distance: 0.17
        palindrome: 0.02
bright_cliff:l6-n10-d3:x0:ultimate_spray        AGCGAT
bright_cliff:l6-n10-d3:x1:bulky_drama   AGTTGC
bright_cliff:l6-n10-d3:x2:tropical_pinball      TTCACG
bright_cliff:l6-n10-d3:x3:unique_info   GTACGT
bright_cliff:l6-n10-d3:x4:chilly_sahara CCTCTT
bright_cliff:l6-n10-d3:x5:novel_wisdom  GACCTA
bright_cliff:l6-n10-d3:x6:oceanic_plume AGACTG
bright_cliff:l6-n10-d3:x7:wanted_jessica        TCTCGA
bright_cliff:l6-n10-d3:x8:incise_radical        TCTGTC
bright_cliff:l6-n10-d3:x9:rebel_option  TAGGAC
Wrote barcode set called bright_cliff, with minimum Hamming distance 3 and maximum Hamming distance 6.

You can bias sampling based on a set of other sequences. This sampling conditions the choice of each base on the previous base.

$ monte barcode -n 5 --amino-acid HELP | monte sample --field 2 --distance 2 -n 5
Generating barcodes with the following parameters:
        ...
Requested barcodes with length 12, and 16777216 possible combinations.
> Tried 12 barcodes, rejected 7, accepted 5; rejection rate is 0.58

Rejection reasons:
        distance: 0.58
        gc_content: 0.08
ritzy_parker:l12-n5-d2:x0:good_race     CACGAATTGCCA
ritzy_parker:l12-n5-d2:x1:wiry_cairo    CATGAACTACCA
ritzy_parker:l12-n5-d2:x2:pricy_scuba   CACGAACTGCCT
ritzy_parker:l12-n5-d2:x3:brisk_neptune CATGAATTGCCG
ritzy_parker:l12-n5-d2:x4:dextrous_frame        CACGAATTACCG
Wrote barcode set called ritzy_parker, with minimum Hamming distance 2 and maximum Hamming distance 3.

You can also check and filter previously generated sets.

$ monte barcode --length 6 -n 10 -d 3 2> /dev/null | monte check --color --field 2
Checking barcodes with the following parameters:
        ...
> Tried 10 barcodes, rejected 6, accepted 4; rejection rate is 0.60
Rejection reasons:
        color_balance: 0.60
Could only generate 4 barcodes, but 10 were requested. You might need to try different settings.
thorough_adam:l6-n4-d4:x0:savvy_ruby    TCCTGA
thorough_adam:l6-n4-d4:x1:elfin_rufus   AGCTTC
thorough_adam:l6-n4-d4:x2:damaged_atlas AAGGCA
thorough_adam:l6-n4-d4:x3:faded_elite   GCACTA
Wrote barcode set called thorough_adam, with minimum Hamming distance 4 and maximum Hamming distance 5.

Or use a previous set as a starting point for generating more, possibly with different parameters.

$ monte barcode -n10 --distance 4 --length 10  --append <(monte barcode -n 5 -a HELP) --append_field 2
Generating barcodes with the following parameters:
...
> Tried 32 barcodes, rejected 22, accepted 10; rejection rate is 0.69

Rejection reasons:
        gc_content: 0.44
        homopolymer: 0.47
        distance: 0.03
        palindrome: 0.03
elegant_triton:l12-n15-d1:x0:vocal_stand        CACGAACTTCCT
elegant_triton:l12-n15-d1:x1:real_clinic        CATGAATTGCCT
elegant_triton:l12-n15-d1:x2:dextrous_frame     CACGAATTACCG
elegant_triton:l12-n15-d1:x3:dizzy_record       CACGAATTACCT
elegant_triton:l12-n15-d1:x4:prudent_jester     CACGAGCTACCA
elegant_triton:l10-n15-d1:x5:useful_cabinet     ACGCGACACT
elegant_triton:l10-n15-d1:x6:deafening_sphere   TAATACGCGC
elegant_triton:l10-n15-d1:x7:old_program        ATCCTAAGCC
elegant_triton:l10-n15-d1:x8:eager_doctor       TTGGCCACTG
elegant_triton:l10-n15-d1:x9:dopey_limbo        ATCCGTCGTA
elegant_triton:l10-n15-d1:x10:plain_lunar       ACGAGAATTC
elegant_triton:l10-n15-d1:x11:discreet_ford     CTAACGTAGC
elegant_triton:l10-n15-d1:x12:proud_jet CTTCAGTGTC
elegant_triton:l10-n15-d1:x13:wry_insect        CAGACTGGAG
elegant_triton:l10-n15-d1:x14:lofty_shave       TTCGTAACTC
Wrote barcode set called elegant_triton, with minimum Hamming distance 1 and maximum Hamming distance 10.

And try to sort by ideal color balance for Illumina chemistries (if you want to use subsets).

$ monte barcode --length 6 -n 15 -d 1 2> /dev/null | monte sort --field 2
Sorting barcodes with the following parameters:
        ...
round_mono:l6-n15-d2:x0:shady_soda      AGTCCT
round_mono:l6-n15-d2:x1:vogue_cosmos    TGAGTC
round_mono:l6-n15-d2:x2:upbeat_baboon   AACGGA
round_mono:l6-n15-d2:x3:sweet_octavia   CATCCT
round_mono:l6-n15-d2:x4:clean_copper    CCTTAG
round_mono:l6-n15-d2:x5:fabulous_partner        TCCTAG
round_mono:l6-n15-d2:x6:defiant_charlie GAACGA
round_mono:l6-n15-d2:x7:misty_miguel    GCATGA
round_mono:l6-n15-d2:x8:urgent_rodeo    ACTGTG
round_mono:l6-n15-d2:x9:injured_news    GAAGGT
round_mono:l6-n15-d2:x10:clear_public   TGAGAG
round_mono:l6-n15-d2:x11:seemly_satire  GATTGG
round_mono:l6-n15-d2:x12:exemplary_robert       TTCAGC
round_mono:l6-n15-d2:x13:nuclear_choice CATCAC
round_mono:l6-n15-d2:x14:discreet_shake GCATTG
Wrote barcode set called round_mono, with minimum Hamming distance 2 and maximum Hamming distance 6.

Details

usage: monte [-h] {barcode,check,sort,sample} ...

Generate random DNA barcodes conforming to contraints, or check sets of barcodes for their conformance.

optional arguments:
  -h, --help            show this help message and exit

Sub-commands:
  {barcode,check,sort,sample}
                        Use these commands to specify the action you want.
    barcode             Generate random barcodes.
    check               Check barcode list.
    sort                Sort barcode list for optimal color balance.
    sample              Generate barcode list by sampling nucleotides from an existing list of sequences.
usage: monte barcode [-h] [--length LENGTH] [--amino-acid AMINO_ACID] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE]
                     [--homopolymer HOMOPOLYMER] [--levenshtein] [--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--output OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  --length LENGTH, -l LENGTH
                        Barcode length. Default: 12
  --amino-acid AMINO_ACID, -a AMINO_ACID
                        Generate barcodes encoding this amino acid sequence. Default: do not use.
  --number NUMBER, -n NUMBER
                        Number of barcodes to generate. Required.
  --rejection-rate REJECTION_RATE, -r REJECTION_RATE
                        Rate of rejection before aborting. Default: 0.85
  --append APPEND       File to take a list of barcodes to extend. Default: do not use
  --append_field APPEND_FIELD
                        Column name or number to take barcodes from for appending. Default: 1
  --distance DISTANCE, -d DISTANCE
                        Minimum distance between barcodes. Default: 1
  --homopolymer HOMOPOLYMER, -p HOMOPOLYMER
                        Maximum homopolymer length. Default: 3
  --levenshtein, -e     Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
  --color, -c           Check optimal Illumina color balance. Default: False
  --gc_min GC_MIN, -g GC_MIN
                        Minimum GC content. Default: 0.4
  --gc_max GC_MAX, -j GC_MAX
                        Maximum GC content. Default: 0.6
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT
usage: monte sample [-h] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein] [--color]
                    [--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
                    [input]

positional arguments:
  input                 Input file. Default: STDIN.

optional arguments:
  -h, --help            show this help message and exit
  --number NUMBER, -n NUMBER
                        Number of barcodes to generate. Required.
  --rejection-rate REJECTION_RATE, -r REJECTION_RATE
                        Rate of rejection before aborting. Default: 0.85
  --append APPEND       File to take a list of barcodes to extend. Default: do not use
  --append_field APPEND_FIELD
                        Column name or number to take barcodes from for appending. Default: 1
  --distance DISTANCE, -d DISTANCE
                        Minimum distance between barcodes. Default: 1
  --homopolymer HOMOPOLYMER, -p HOMOPOLYMER
                        Maximum homopolymer length. Default: 3
  --levenshtein, -e     Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
  --color, -c           Check optimal Illumina color balance. Default: False
  --gc_min GC_MIN, -g GC_MIN
                        Minimum GC content. Default: 0.4
  --gc_max GC_MAX, -j GC_MAX
                        Maximum GC content. Default: 0.6
  --field FIELD, -f FIELD
                        Column name or number for barcode sequences. Default: 1
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT
usage: monte check [-h] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein]
                   [--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
                   [input]

positional arguments:
  input                 Input file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --distance DISTANCE, -d DISTANCE
                        Minimum distance between barcodes. Default: 1
  --homopolymer HOMOPOLYMER, -p HOMOPOLYMER
                        Maximum homopolymer length. Default: 3
  --levenshtein, -e     Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
  --color, -c           Check optimal Illumina color balance. Default: False
  --gc_min GC_MIN, -g GC_MIN
                        Minimum GC content. Default: 0.4
  --gc_max GC_MAX, -j GC_MAX
                        Maximum GC content. Default: 0.6
  --field FIELD, -f FIELD
                        Column number for barcode sequences. Default: 1
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT
usage: monte sort [-h] [--field FIELD] [--output OUTPUT] [input]

positional arguments:
  input                 Input file. Default: STDIN.

options:
  -h, --help            show this help message and exit
  --field FIELD, -f FIELD
                        Column number for barcode sequences. Default: 1
  --output OUTPUT, -o OUTPUT
                        Output file. Default: STDOUT

Python API

monte-barcode can be imported into Python to generate and check barcodes in your own programs.

import montebarcode as mb

Generate random DNA sequences.

>>> for bc in mb.infinite_barcodes(length=20, check_used=False): 
...     print(bc)
...     break
... 
ATCAGTCGTCACACTAGTTA

Or peptide-encoding sequences.

>>> list(mb.codon_barcodes("L", ordered=True)) 
['CTT', 'CTC', 'CTA', 'CTG', 'TTA', 'TTG']

You can check the minimum and maximum distances among a set.

>>> mb.minmax_distance(['AAA', 'AAA'])
(0, 0)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAT'])
(1, 3)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAAT'], use_levenshtein=False)
(0, 3)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAAT'])
(1, 4)

And get usage of each base at each position.

>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[0]['A']
0.25
>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[1]['G']
0
>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[2]['A']
0.5

You can see whether adding a barcode to a set would throw off the Illumina color balance.

>>> mb.IlluminaColorBalance()('AAAT', ['TCGC', 'ACAG', 'TGGC', 'ATCG'])
True
>>> mb.IlluminaColorBalance()('AAAT', ['TCGC', 'CCAG', 'TGGC', 'ATCG'])
False

And run a suite of checks against a set of barcodes (or infinite stream), retrieving failure reasons, number of tries, and conforming barcode set.

>>> checks = [mb.Homopolymer(), mb.Palindrome()]
>>> mb.make_checks(['AAAAT', 'CCCGGG', 'ATCGCG', 'GCCGAT'], n=4, checks=checks, quiet=True)
(Counter({'homopolymer': 1, 'palindrome': 1}), 4, ['ATCGCG', 'GCCGAT'])
>>> mb.make_checks(['AAAAT', 'CCCGGG', 'ATCGCG', 'GCCGAT'], n=1, checks=checks, quiet=True)
(Counter({'homopolymer': 1, 'palindrome': 1}), 3, ['ATCGCG'])

Documentation

Full API documentation is at ReadTheDocs.