Generating sets of random DNA sequences optimized for use in high-throughput sequencing.
Install the pre-compiled version from PyPI:
pip install monte-barcode
Clone the repository, then cd
into it. Then run:
pip install -e .
monte barcode provides command line utilities to generate completely random or peptide-encoding barcodes conforming to custom contraints, like minimum edit distance among the set, GC content, and color balance for Illumina chemistry.
Barcode sets and individual barcodes are deterministically given an adjective-noun mnemonic (generated by nemony) for easy reference.
Each utility writes a lot of commentary to stderr
, but the barcodes go to
stdout
by default so they can be piped.
Generate random barcodes of a particular length.
$ monte barcode --length 6 -n 5
Generating barcodes with the following parameters:
...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 16 barcodes, rejected 11, accepted 5; rejection rate is 0.69
Rejection reasons:
gc_content: 0.62
homopolymer: 0.25
restriction_sites: 0.06
mighty_orchid:l6-n5-d3:x0:fresh_prague TGAGGT
mighty_orchid:l6-n5-d3:x1:flexible_forest AGTTCG
mighty_orchid:l6-n5-d3:x2:fun_baby GACATC
mighty_orchid:l6-n5-d3:x3:woolly_podium TGTCCT
mighty_orchid:l6-n5-d3:x4:strong_factor GAACCA
Wrote barcode set called mighty_orchid, with minimum Hamming distance 3 and maximum Hamming distance 6.
Or encoding a peptide.
$ monte barcode --amino-acid HELP -n 5
Generating barcodes with the following parameters:
...
Using amino acid sequence HELP with length 12 and 96 possible combinations.
> Tried 7 barcodes, rejected 2, accepted 5; rejection rate is 0.29
Rejection reasons:
gc_content: 0.14
homopolymer: 0.14
basic_hamlet:l12-n5-d2:x0:volatile_lesson CATGAGCTGCCT
basic_hamlet:l12-n5-d2:x1:pricy_scuba CACGAACTGCCT
basic_hamlet:l12-n5-d2:x2:good_race CACGAATTGCCA
basic_hamlet:l12-n5-d2:x3:demanding_bruno CATGAATTACCG
basic_hamlet:l12-n5-d2:x4:pawky_plaster CATGAGTTACCT
Wrote barcode set called basic_hamlet, with minimum Hamming distance 2 and maximum Hamming distance 4.
Insist on a minimum edit distance.
$ monte barcode --length 6 -n 10 -d 3
Generating barcodes with the following parameters:
...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 39 barcodes, rejected 29, accepted 10; rejection rate is 0.74
Rejection reasons:
gc_content: 0.67
distance: 0.13
homopolymer: 0.05
scenic_blast:l6-n10-d3:x0:acidic_turtle TGTGTG
scenic_blast:l6-n10-d3:x1:rowdy_grace ACCATC
scenic_blast:l6-n10-d3:x2:rich_export CGTTAG
scenic_blast:l6-n10-d3:x3:unique_break GGAATC
scenic_blast:l6-n10-d3:x4:careful_fuji GCAAGT
scenic_blast:l6-n10-d3:x5:whimsical_derby CGGAAT
scenic_blast:l6-n10-d3:x6:pricy_aloha TTCTCC
scenic_blast:l6-n10-d3:x7:zestful_ricardo AGAGCT
scenic_blast:l6-n10-d3:x8:terse_cobra AAGTCC
scenic_blast:l6-n10-d3:x9:zany_chamber TTACGG
Wrote barcode set called scenic_blast, with minimum Hamming distance 3 and maximum Hamming distance 6.
Or insist on ideal color balance for Illumina chemistry.
$ monte barcode --length 6 -n 10 -d 3 --color
Generating barcodes with the following parameters:
...
Requested barcodes with length 6, and 4096 possible combinations.
> Tried 151 barcodes, rejected 141, accepted 10; rejection rate is 0.93
Rejection reasons:
gc_content: 0.65
homopolymer: 0.21
color_balance: 0.72
distance: 0.17
palindrome: 0.02
bright_cliff:l6-n10-d3:x0:ultimate_spray AGCGAT
bright_cliff:l6-n10-d3:x1:bulky_drama AGTTGC
bright_cliff:l6-n10-d3:x2:tropical_pinball TTCACG
bright_cliff:l6-n10-d3:x3:unique_info GTACGT
bright_cliff:l6-n10-d3:x4:chilly_sahara CCTCTT
bright_cliff:l6-n10-d3:x5:novel_wisdom GACCTA
bright_cliff:l6-n10-d3:x6:oceanic_plume AGACTG
bright_cliff:l6-n10-d3:x7:wanted_jessica TCTCGA
bright_cliff:l6-n10-d3:x8:incise_radical TCTGTC
bright_cliff:l6-n10-d3:x9:rebel_option TAGGAC
Wrote barcode set called bright_cliff, with minimum Hamming distance 3 and maximum Hamming distance 6.
You can bias sampling based on a set of other sequences. This sampling conditions the choice of each base on the previous base.
$ monte barcode -n 5 --amino-acid HELP | monte sample --field 2 --distance 2 -n 5
Generating barcodes with the following parameters:
...
Requested barcodes with length 12, and 16777216 possible combinations.
> Tried 12 barcodes, rejected 7, accepted 5; rejection rate is 0.58
Rejection reasons:
distance: 0.58
gc_content: 0.08
ritzy_parker:l12-n5-d2:x0:good_race CACGAATTGCCA
ritzy_parker:l12-n5-d2:x1:wiry_cairo CATGAACTACCA
ritzy_parker:l12-n5-d2:x2:pricy_scuba CACGAACTGCCT
ritzy_parker:l12-n5-d2:x3:brisk_neptune CATGAATTGCCG
ritzy_parker:l12-n5-d2:x4:dextrous_frame CACGAATTACCG
Wrote barcode set called ritzy_parker, with minimum Hamming distance 2 and maximum Hamming distance 3.
You can also check and filter previously generated sets.
$ monte barcode --length 6 -n 10 -d 3 2> /dev/null | monte check --color --field 2
Checking barcodes with the following parameters:
...
> Tried 10 barcodes, rejected 6, accepted 4; rejection rate is 0.60
Rejection reasons:
color_balance: 0.60
Could only generate 4 barcodes, but 10 were requested. You might need to try different settings.
thorough_adam:l6-n4-d4:x0:savvy_ruby TCCTGA
thorough_adam:l6-n4-d4:x1:elfin_rufus AGCTTC
thorough_adam:l6-n4-d4:x2:damaged_atlas AAGGCA
thorough_adam:l6-n4-d4:x3:faded_elite GCACTA
Wrote barcode set called thorough_adam, with minimum Hamming distance 4 and maximum Hamming distance 5.
Or use a previous set as a starting point for generating more, possibly with different parameters.
$ monte barcode -n10 --distance 4 --length 10 --append <(monte barcode -n 5 -a HELP) --append_field 2
Generating barcodes with the following parameters:
...
> Tried 32 barcodes, rejected 22, accepted 10; rejection rate is 0.69
Rejection reasons:
gc_content: 0.44
homopolymer: 0.47
distance: 0.03
palindrome: 0.03
elegant_triton:l12-n15-d1:x0:vocal_stand CACGAACTTCCT
elegant_triton:l12-n15-d1:x1:real_clinic CATGAATTGCCT
elegant_triton:l12-n15-d1:x2:dextrous_frame CACGAATTACCG
elegant_triton:l12-n15-d1:x3:dizzy_record CACGAATTACCT
elegant_triton:l12-n15-d1:x4:prudent_jester CACGAGCTACCA
elegant_triton:l10-n15-d1:x5:useful_cabinet ACGCGACACT
elegant_triton:l10-n15-d1:x6:deafening_sphere TAATACGCGC
elegant_triton:l10-n15-d1:x7:old_program ATCCTAAGCC
elegant_triton:l10-n15-d1:x8:eager_doctor TTGGCCACTG
elegant_triton:l10-n15-d1:x9:dopey_limbo ATCCGTCGTA
elegant_triton:l10-n15-d1:x10:plain_lunar ACGAGAATTC
elegant_triton:l10-n15-d1:x11:discreet_ford CTAACGTAGC
elegant_triton:l10-n15-d1:x12:proud_jet CTTCAGTGTC
elegant_triton:l10-n15-d1:x13:wry_insect CAGACTGGAG
elegant_triton:l10-n15-d1:x14:lofty_shave TTCGTAACTC
Wrote barcode set called elegant_triton, with minimum Hamming distance 1 and maximum Hamming distance 10.
And try to sort by ideal color balance for Illumina chemistries (if you want to use subsets).
$ monte barcode --length 6 -n 15 -d 1 2> /dev/null | monte sort --field 2
Sorting barcodes with the following parameters:
...
round_mono:l6-n15-d2:x0:shady_soda AGTCCT
round_mono:l6-n15-d2:x1:vogue_cosmos TGAGTC
round_mono:l6-n15-d2:x2:upbeat_baboon AACGGA
round_mono:l6-n15-d2:x3:sweet_octavia CATCCT
round_mono:l6-n15-d2:x4:clean_copper CCTTAG
round_mono:l6-n15-d2:x5:fabulous_partner TCCTAG
round_mono:l6-n15-d2:x6:defiant_charlie GAACGA
round_mono:l6-n15-d2:x7:misty_miguel GCATGA
round_mono:l6-n15-d2:x8:urgent_rodeo ACTGTG
round_mono:l6-n15-d2:x9:injured_news GAAGGT
round_mono:l6-n15-d2:x10:clear_public TGAGAG
round_mono:l6-n15-d2:x11:seemly_satire GATTGG
round_mono:l6-n15-d2:x12:exemplary_robert TTCAGC
round_mono:l6-n15-d2:x13:nuclear_choice CATCAC
round_mono:l6-n15-d2:x14:discreet_shake GCATTG
Wrote barcode set called round_mono, with minimum Hamming distance 2 and maximum Hamming distance 6.
usage: monte [-h] {barcode,check,sort,sample} ...
Generate random DNA barcodes conforming to contraints, or check sets of barcodes for their conformance.
optional arguments:
-h, --help show this help message and exit
Sub-commands:
{barcode,check,sort,sample}
Use these commands to specify the action you want.
barcode Generate random barcodes.
check Check barcode list.
sort Sort barcode list for optimal color balance.
sample Generate barcode list by sampling nucleotides from an existing list of sequences.
usage: monte barcode [-h] [--length LENGTH] [--amino-acid AMINO_ACID] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE]
[--homopolymer HOMOPOLYMER] [--levenshtein] [--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--output OUTPUT]
optional arguments:
-h, --help show this help message and exit
--length LENGTH, -l LENGTH
Barcode length. Default: 12
--amino-acid AMINO_ACID, -a AMINO_ACID
Generate barcodes encoding this amino acid sequence. Default: do not use.
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--append APPEND File to take a list of barcodes to extend. Default: do not use
--append_field APPEND_FIELD
Column name or number to take barcodes from for appending. Default: 1
--distance DISTANCE, -d DISTANCE
Minimum distance between barcodes. Default: 1
--homopolymer HOMOPOLYMER, -p HOMOPOLYMER
Maximum homopolymer length. Default: 3
--levenshtein, -e Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
--color, -c Check optimal Illumina color balance. Default: False
--gc_min GC_MIN, -g GC_MIN
Minimum GC content. Default: 0.4
--gc_max GC_MAX, -j GC_MAX
Maximum GC content. Default: 0.6
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
usage: monte sample [-h] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein] [--color]
[--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
[input]
positional arguments:
input Input file. Default: STDIN.
optional arguments:
-h, --help show this help message and exit
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--append APPEND File to take a list of barcodes to extend. Default: do not use
--append_field APPEND_FIELD
Column name or number to take barcodes from for appending. Default: 1
--distance DISTANCE, -d DISTANCE
Minimum distance between barcodes. Default: 1
--homopolymer HOMOPOLYMER, -p HOMOPOLYMER
Maximum homopolymer length. Default: 3
--levenshtein, -e Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
--color, -c Check optimal Illumina color balance. Default: False
--gc_min GC_MIN, -g GC_MIN
Minimum GC content. Default: 0.4
--gc_max GC_MAX, -j GC_MAX
Maximum GC content. Default: 0.6
--field FIELD, -f FIELD
Column name or number for barcode sequences. Default: 1
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
usage: monte check [-h] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein]
[--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
[input]
positional arguments:
input Input file. Default: STDIN.
options:
-h, --help show this help message and exit
--distance DISTANCE, -d DISTANCE
Minimum distance between barcodes. Default: 1
--homopolymer HOMOPOLYMER, -p HOMOPOLYMER
Maximum homopolymer length. Default: 3
--levenshtein, -e Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
--color, -c Check optimal Illumina color balance. Default: False
--gc_min GC_MIN, -g GC_MIN
Minimum GC content. Default: 0.4
--gc_max GC_MAX, -j GC_MAX
Maximum GC content. Default: 0.6
--field FIELD, -f FIELD
Column number for barcode sequences. Default: 1
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
usage: monte sort [-h] [--field FIELD] [--output OUTPUT] [input]
positional arguments:
input Input file. Default: STDIN.
options:
-h, --help show this help message and exit
--field FIELD, -f FIELD
Column number for barcode sequences. Default: 1
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
monte-barcode can be imported into Python to generate and check barcodes in your own programs.
import montebarcode as mb
Generate random DNA sequences.
>>> for bc in mb.infinite_barcodes(length=20, check_used=False):
... print(bc)
... break
...
ATCAGTCGTCACACTAGTTA
Or peptide-encoding sequences.
>>> list(mb.codon_barcodes("L", ordered=True))
['CTT', 'CTC', 'CTA', 'CTG', 'TTA', 'TTG']
You can check the minimum and maximum distances among a set.
>>> mb.minmax_distance(['AAA', 'AAA'])
(0, 0)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAT'])
(1, 3)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAAT'], use_levenshtein=False)
(0, 3)
>>> mb.minmax_distance(['AAA', 'TCG', 'AAAT'])
(1, 4)
And get usage of each base at each position.
>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[0]['A']
0.25
>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[1]['G']
0
>>> mb.base_usage(['AAA', 'TTT', 'GCT', 'CCA'])[2]['A']
0.5
You can see whether adding a barcode to a set would throw off the Illumina color balance.
>>> mb.IlluminaColorBalance()('AAAT', ['TCGC', 'ACAG', 'TGGC', 'ATCG'])
True
>>> mb.IlluminaColorBalance()('AAAT', ['TCGC', 'CCAG', 'TGGC', 'ATCG'])
False
And run a suite of checks against a set of barcodes (or infinite stream), retrieving failure reasons, number of tries, and conforming barcode set.
>>> checks = [mb.Homopolymer(), mb.Palindrome()]
>>> mb.make_checks(['AAAAT', 'CCCGGG', 'ATCGCG', 'GCCGAT'], n=4, checks=checks, quiet=True)
(Counter({'homopolymer': 1, 'palindrome': 1}), 4, ['ATCGCG', 'GCCGAT'])
>>> mb.make_checks(['AAAAT', 'CCCGGG', 'ATCGCG', 'GCCGAT'], n=1, checks=checks, quiet=True)
(Counter({'homopolymer': 1, 'palindrome': 1}), 3, ['ATCGCG'])
Full API documentation is at ReadTheDocs.