Skip to content

Commit

Permalink
Merge pull request #1 from scbirlab/sample-from-existing
Browse files Browse the repository at this point in the history
Sample from exisitng list and extend existing list
  • Loading branch information
eachanjohnson authored Jun 2, 2023
2 parents bd54bb9 + 284bbb4 commit bf71471
Show file tree
Hide file tree
Showing 9 changed files with 480 additions and 143 deletions.
16 changes: 10 additions & 6 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,7 @@

name: Python package

on:
push:
branches: [ $default-branch ]
pull_request:
branches: [ $default-branch ]
on: [push]

jobs:
build:
Expand All @@ -29,6 +25,7 @@ jobs:
python -m pip install --upgrade pip
python -m pip install flake8 pytest pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install -e .
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
Expand All @@ -37,4 +34,11 @@ jobs:
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest montebarcode --doctest-modules --junitxml=tests/test-results.xml --cov=com --cov-report=xml --cov-report=html
pytest montebarcode --doctest-modules --junitxml=tests/test-results.xml --cov=com --cov-report=xml --cov-report=html
- name: Upload pytest test results
uses: actions/upload-artifact@v3
with:
name: pytest-results-${{ matrix.python-version }}
path: junit/test-results-${{ matrix.python-version }}.xml
# Use always() to always run this step to publish test results when there are test failures
if: ${{ always() }}
17 changes: 4 additions & 13 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,16 @@
*.so
*.egg-info
*.whl
/build/lib
/build/bazel*
/dist/
.ipynb_checkpoints
/bazel-*
.jax_configure.bazelrc
/tensorflow
.DS_Store
.mypy_cache/
.pytype/
/docs/build
*_pb2.py
/docs/notebooks/.ipynb_checkpoints/
/docs/_autosummary
docs/build
docs/_autosummary
.idea
.vscode
.envrc
jax.iml
.bazelrc.user
__pycache__
.pytest_cache

# virtualenv/venv directories
/venv/
Expand Down
127 changes: 115 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# 🔴🟢🔵⚫️ monte barcode

![GitHub Workflow Status (with branch)](https://img.shields.io/github/actions/workflow/status/scbirlab/monte-barcode/python-publish.yml)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/monte-barcode)
![PyPI](https://img.shields.io/pypi/v/monte-barcode)

Expand Down Expand Up @@ -32,7 +33,7 @@ distance among the set, GC content, and color balance for Illumina chemistry.
Barcode sets and individual barcodes are deterministically given an adjective-noun mnemonic
(generated by [nemony](https://github.com/scbirlab/nemony)) for easy reference.

Each utility gives a lot of commentary to `stderr`, but the barcodes go to
Each utility writes a lot of commentary to `stderr`, but the barcodes go to
`stdout` by default so they can be piped.

### Command line
Expand Down Expand Up @@ -133,6 +134,27 @@ bright_cliff:l6-n10-d3:x9:rebel_option TAGGAC
Wrote barcode set called bright_cliff, with minimum Hamming distance 3 and maximum Hamming distance 6.
```

You can bias sampling based on a set of other sequences. This sampling conditions the choice of each base
on the previous base.

```bash
$ monte barcode -n 5 --amino-acid HELP | monte sample --field 2 --distance 2 -n 5
Generating barcodes with the following parameters:
...
Requested barcodes with length 12, and 16777216 possible combinations.
> Tried 12 barcodes, rejected 7, accepted 5; rejection rate is 0.58

Rejection reasons:
distance: 0.58
gc_content: 0.08
ritzy_parker:l12-n5-d2:x0:good_race CACGAATTGCCA
ritzy_parker:l12-n5-d2:x1:wiry_cairo CATGAACTACCA
ritzy_parker:l12-n5-d2:x2:pricy_scuba CACGAACTGCCT
ritzy_parker:l12-n5-d2:x3:brisk_neptune CATGAATTGCCG
ritzy_parker:l12-n5-d2:x4:dextrous_frame CACGAATTACCG
Wrote barcode set called ritzy_parker, with minimum Hamming distance 2 and maximum Hamming distance 3.
```

You can also check and filter previously generated sets.

```bash
Expand All @@ -151,8 +173,38 @@ Wrote barcode set called thorough_adam, with minimum Hamming distance 4 and maxi

```

And try to sort by ideal color balance for Illumina chemistries (if you want to use subsets).
Or use a previous set as a starting point for generating more, possibly with different parameters.

```bash
$ monte barcode -n10 --distance 4 --length 10 --append <(monte barcode -n 5 -a HELP) --append_field 2
Generating barcodes with the following parameters:
...
> Tried 32 barcodes, rejected 22, accepted 10; rejection rate is 0.69

Rejection reasons:
gc_content: 0.44
homopolymer: 0.47
distance: 0.03
palindrome: 0.03
elegant_triton:l12-n15-d1:x0:vocal_stand CACGAACTTCCT
elegant_triton:l12-n15-d1:x1:real_clinic CATGAATTGCCT
elegant_triton:l12-n15-d1:x2:dextrous_frame CACGAATTACCG
elegant_triton:l12-n15-d1:x3:dizzy_record CACGAATTACCT
elegant_triton:l12-n15-d1:x4:prudent_jester CACGAGCTACCA
elegant_triton:l10-n15-d1:x5:useful_cabinet ACGCGACACT
elegant_triton:l10-n15-d1:x6:deafening_sphere TAATACGCGC
elegant_triton:l10-n15-d1:x7:old_program ATCCTAAGCC
elegant_triton:l10-n15-d1:x8:eager_doctor TTGGCCACTG
elegant_triton:l10-n15-d1:x9:dopey_limbo ATCCGTCGTA
elegant_triton:l10-n15-d1:x10:plain_lunar ACGAGAATTC
elegant_triton:l10-n15-d1:x11:discreet_ford CTAACGTAGC
elegant_triton:l10-n15-d1:x12:proud_jet CTTCAGTGTC
elegant_triton:l10-n15-d1:x13:wry_insect CAGACTGGAG
elegant_triton:l10-n15-d1:x14:lofty_shave TTCGTAACTC
Wrote barcode set called elegant_triton, with minimum Hamming distance 1 and maximum Hamming distance 10.
```

And try to sort by ideal color balance for Illumina chemistries (if you want to use subsets).

```bash
$ monte barcode --length 6 -n 15 -d 1 2> /dev/null | monte sort --field 2
Expand All @@ -179,21 +231,70 @@ Wrote barcode set called round_mono, with minimum Hamming distance 2 and maximum
#### Details

```bash
usage: monte barcode [-h] --number NUMBER [--length LENGTH] [--rejection-rate REJECTION_RATE]
[--amino-acid AMINO_ACID] [--distance DISTANCE] [--homopolymer HOMOPOLYMER]
[--levenshtein] [--color] [--gc_min GC_MIN] [--gc_max GC_MAX]
[--output OUTPUT]
usage: monte [-h] {barcode,check,sort,sample} ...

options:
Generate random DNA barcodes conforming to contraints, or check sets of barcodes for their conformance.

optional arguments:
-h, --help show this help message and exit

Sub-commands:
{barcode,check,sort,sample}
Use these commands to specify the action you want.
barcode Generate random barcodes.
check Check barcode list.
sort Sort barcode list for optimal color balance.
sample Generate barcode list by sampling nucleotides from an existing list of sequences.
```

```bash
usage: monte barcode [-h] [--length LENGTH] [--amino-acid AMINO_ACID] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE]
[--homopolymer HOMOPOLYMER] [--levenshtein] [--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--output OUTPUT]

optional arguments:
-h, --help show this help message and exit
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--length LENGTH, -l LENGTH
Barcode length. Default: 12
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--amino-acid AMINO_ACID, -a AMINO_ACID
Generate barcodes encoding this amino acid sequence. Default: do not use.
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--append APPEND File to take a list of barcodes to extend. Default: do not use
--append_field APPEND_FIELD
Column name or number to take barcodes from for appending. Default: 1
--distance DISTANCE, -d DISTANCE
Minimum distance between barcodes. Default: 1
--homopolymer HOMOPOLYMER, -p HOMOPOLYMER
Maximum homopolymer length. Default: 3
--levenshtein, -e Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
--color, -c Check optimal Illumina color balance. Default: False
--gc_min GC_MIN, -g GC_MIN
Minimum GC content. Default: 0.4
--gc_max GC_MAX, -j GC_MAX
Maximum GC content. Default: 0.6
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
```

```bash
usage: monte sample [-h] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein] [--color]
[--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
[input]

positional arguments:
input Input file. Default: STDIN.

optional arguments:
-h, --help show this help message and exit
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--append APPEND File to take a list of barcodes to extend. Default: do not use
--append_field APPEND_FIELD
Column name or number to take barcodes from for appending. Default: 1
--distance DISTANCE, -d DISTANCE
Minimum distance between barcodes. Default: 1
--homopolymer HOMOPOLYMER, -p HOMOPOLYMER
Expand All @@ -204,6 +305,8 @@ options:
Minimum GC content. Default: 0.4
--gc_max GC_MAX, -j GC_MAX
Maximum GC content. Default: 0.6
--field FIELD, -f FIELD
Column name or number for barcode sequences. Default: 1
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
```
Expand Down Expand Up @@ -320,4 +423,4 @@ retrieving failure reasons, number of tries, and conforming barcode set.

### Documentation

Full API documentation is [here](https://monte-barcode.readthedocs.org).
Full API documentation is at [ReadTheDocs](https://monte-barcode.readthedocs.org).
101 changes: 91 additions & 10 deletions docs/source/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,38 @@ Wrote barcode set called thorough_adam, with minimum Hamming distance 4 and maxi

```

And try to sort by ideal color balance for Illumina chemistries (if you want to use subsets).
Or use a previous set as a starting point for generating more, possibly with different parameters.

```bash
$ monte barcode -n10 --distance 4 --length 10 --append <(monte barcode -n 5 -a HELP) --append_field 2
Generating barcodes with the following parameters:
...
> Tried 32 barcodes, rejected 22, accepted 10; rejection rate is 0.69

Rejection reasons:
gc_content: 0.44
homopolymer: 0.47
distance: 0.03
palindrome: 0.03
elegant_triton:l12-n15-d1:x0:vocal_stand CACGAACTTCCT
elegant_triton:l12-n15-d1:x1:real_clinic CATGAATTGCCT
elegant_triton:l12-n15-d1:x2:dextrous_frame CACGAATTACCG
elegant_triton:l12-n15-d1:x3:dizzy_record CACGAATTACCT
elegant_triton:l12-n15-d1:x4:prudent_jester CACGAGCTACCA
elegant_triton:l10-n15-d1:x5:useful_cabinet ACGCGACACT
elegant_triton:l10-n15-d1:x6:deafening_sphere TAATACGCGC
elegant_triton:l10-n15-d1:x7:old_program ATCCTAAGCC
elegant_triton:l10-n15-d1:x8:eager_doctor TTGGCCACTG
elegant_triton:l10-n15-d1:x9:dopey_limbo ATCCGTCGTA
elegant_triton:l10-n15-d1:x10:plain_lunar ACGAGAATTC
elegant_triton:l10-n15-d1:x11:discreet_ford CTAACGTAGC
elegant_triton:l10-n15-d1:x12:proud_jet CTTCAGTGTC
elegant_triton:l10-n15-d1:x13:wry_insect CAGACTGGAG
elegant_triton:l10-n15-d1:x14:lofty_shave TTCGTAACTC
Wrote barcode set called elegant_triton, with minimum Hamming distance 1 and maximum Hamming distance 10.
```

And try to sort by ideal color balance for Illumina chemistries (if you want to use subsets).

```bash
$ monte barcode --length 6 -n 15 -d 1 2> /dev/null | monte sort --field 2
Expand All @@ -154,21 +184,39 @@ Wrote barcode set called round_mono, with minimum Hamming distance 2 and maximum
### Details

```bash
usage: monte barcode [-h] --number NUMBER [--length LENGTH] [--rejection-rate REJECTION_RATE]
[--amino-acid AMINO_ACID] [--distance DISTANCE] [--homopolymer HOMOPOLYMER]
[--levenshtein] [--color] [--gc_min GC_MIN] [--gc_max GC_MAX]
[--output OUTPUT]
usage: monte [-h] {barcode,check,sort,sample} ...

options:
Generate random DNA barcodes conforming to contraints, or check sets of barcodes for their conformance.

optional arguments:
-h, --help show this help message and exit

Sub-commands:
{barcode,check,sort,sample}
Use these commands to specify the action you want.
barcode Generate random barcodes.
check Check barcode list.
sort Sort barcode list for optimal color balance.
sample Generate barcode list by sampling nucleotides from an existing list of sequences.
```

```bash
usage: monte barcode [-h] [--length LENGTH] [--amino-acid AMINO_ACID] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE]
[--homopolymer HOMOPOLYMER] [--levenshtein] [--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--output OUTPUT]

optional arguments:
-h, --help show this help message and exit
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--length LENGTH, -l LENGTH
Barcode length. Default: 12
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--amino-acid AMINO_ACID, -a AMINO_ACID
Generate barcodes encoding this amino acid sequence. Default: do not use.
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--append APPEND File to take a list of barcodes to extend. Default: do not use
--append_field APPEND_FIELD
Column name or number to take barcodes from for appending. Default: 1
--distance DISTANCE, -d DISTANCE
Minimum distance between barcodes. Default: 1
--homopolymer HOMOPOLYMER, -p HOMOPOLYMER
Expand All @@ -183,6 +231,39 @@ options:
Output file. Default: STDOUT
```

```bash
usage: monte sample [-h] --number NUMBER [--rejection-rate REJECTION_RATE] [--append APPEND] [--append_field APPEND_FIELD] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein] [--color]
[--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
[input]

positional arguments:
input Input file. Default: STDIN.

optional arguments:
-h, --help show this help message and exit
--number NUMBER, -n NUMBER
Number of barcodes to generate. Required.
--rejection-rate REJECTION_RATE, -r REJECTION_RATE
Rate of rejection before aborting. Default: 0.85
--append APPEND File to take a list of barcodes to extend. Default: do not use
--append_field APPEND_FIELD
Column name or number to take barcodes from for appending. Default: 1
--distance DISTANCE, -d DISTANCE
Minimum distance between barcodes. Default: 1
--homopolymer HOMOPOLYMER, -p HOMOPOLYMER
Maximum homopolymer length. Default: 3
--levenshtein, -e Use Levenshtein distance. Otherwise using Hamming diatnce. Default: False
--color, -c Check optimal Illumina color balance. Default: False
--gc_min GC_MIN, -g GC_MIN
Minimum GC content. Default: 0.4
--gc_max GC_MAX, -j GC_MAX
Maximum GC content. Default: 0.6
--field FIELD, -f FIELD
Column name or number for barcode sequences. Default: 1
--output OUTPUT, -o OUTPUT
Output file. Default: STDOUT
```

```bash
usage: monte check [-h] [--distance DISTANCE] [--homopolymer HOMOPOLYMER] [--levenshtein]
[--color] [--gc_min GC_MIN] [--gc_max GC_MAX] [--field FIELD] [--output OUTPUT]
Expand Down
Loading

0 comments on commit bf71471

Please sign in to comment.