Skip to content

Latest commit

Β 

History

History
274 lines (195 loc) Β· 17.2 KB

README.md

File metadata and controls

274 lines (195 loc) Β· 17.2 KB

🌱 crispio

GitHub Workflow Status (with branch) PyPI - Python Version PyPI

Command-line utilities and Python API for designing CRISPRi experiments in bacteria.

crispio makes it easy to design annotated and systematically named libraries of guide RNAs. Alternatively, crispio can map a FASTA of existing guides to a genome.

Hint: If you have a table of guide RNAs from the literature that you want to annotate with genomic features, crispio is your tool. Use bioino table2fasta to convert the table to a FASTA file, then use crispio map.

Installation

The easy way

Install the pre-compiled version from GitHub:

$ pip install crispio

From source

Clone the repository, then cd into it. Then run:

$ pip install -e .

Command-line interface

The main way to use crispio is with its several subcommands. You can get help by entering crispio <subcommand> --help.

$ crispio --help
usage: crispio [-h] {generate,map,featurize,offtarget} ...

Design and analysis of bacterial CRISPRi experiments.

optional arguments:
  -h, --help            show this help message and exit

Sub-commands:
  {generate,map,featurize,offtarget}
                        Use these commands to specify the tool you want to use.
    generate            Generate and annotate all guide RNAs for a given genome.
    map                 Map and annotate provided guide RNAs to a given genome.
    featurize           Annotate guide RNAs with additional calculated features.
    offtarget           Compare two sets of guide RNAs for potential cross-target activity.

Generating new guide RNAs

Given a genome in FASTA format and a matching GFF, both available for your favourite bacterium from NCBI, along with a PAM sequence or name of a common Cas9 ortholog, you can generate all the possible guide RNAs and annotate them from the GFF in one go.

The command crisio generate finds the position on the genome, annotates with genomic features, replichore, and sequence context, detects restriction sites, and gives each guide RNA a unique ID and a human-readable adjective-noun mnemonic.

$ crispio generate -l 20 --pam Sth1 -g EcoMG1655-NC_000913.3.fasta -a EcoMG1655-NC_000913.3.gff3 | head

πŸš€ Generating sgRNAs with the following parameters:
        ...

##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
##genome-sequence       NC_000913.3     1       6930
##genome-description    Escherichia coli str. K-12 substr. MG1655, complete genome
##genome-filename       63
#sgRNA-map      crispio 63      62
NC_000913.3     crispio protospacer     2       21      .       +       .       ID=sgr-e5373243;Name=thrL-21-modest_saddle;ann_Name=thrL;ann_end=255;ann_feature=gene;ann_gene=thrL;ann_gene_biotype=protein_coding;ann_locus_tag=_up-thrL;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=190;ann_strand=+;guide_context_down=TATGTCTCTGTGTGGATTAA;guide_context_up=CAGCACCCCAGGAACCCATA;guide_length=20;guide_re_sites=;guide_sequence=GCTTTTCATTCTGACTGCAA;guide_sequence_hash=19d8fdaa;mnemonic=modest_saddle;pam_end=28;pam_offset=-166;pam_replichore=R;pam_search=NNRGVAN;pam_sequence=CGGGCAA;pam_start=21;source_name=thrL-21-modest_saddle

...

NC_000913.3     crispio protospacer     180     199     .       +       .       ID=sgr-b71e7fa7;Name=thrL-199-bouncy_sabine;ann_Name=thrL;ann_end=255;ann_feature=gene;ann_gene=thrL;ann_gene_biotype=protein_coding;ann_locus_tag=b0001;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=190;ann_strand=+;guide_context_down=CACCATTACCACCACCATCA;guide_context_up=CAGATAAAAATTACAGAGTA;guide_length=20;guide_re_sites=;guide_sequence=CACAACATCCATGAAACGCA;guide_sequence_hash=3e6eb3a0;mnemonic=bouncy_sabine;pam_end=206;pam_offset=65;pam_replichore=R;pam_search=NNRGVAN;pam_sequence=TTAGCAC;pam_start=199;source_name=thrL-199-bouncy_sabine

The guides are output in GFF format, so it can be used directly as an annotation track in your favourite genome browser.

It can be dense for human beings to parse, so you can convert to a TSV using bioino ggf2table:

$ head guides.gff | bioino gff2table
...

seqid   source  feature start   end     score   strand  phase   ID      Name    ann_Name        ann_end ann_feature     ann_gene     ann_gene_biotype ann_locus_tag   ann_phase       ann_score       ann_seqid       ann_source      ann_start       ann_strand      guide_context_down    guide_context_up        guide_length    guide_re_sites  guide_sequence  guide_sequence_hash     mnemonic        pam_end       pam_offset      pam_replichore  pam_search      pam_sequence    pam_start       source_name
NC_000913.3     crispio protospacer     2       21      .       +       .       sgr-e5373243    thrL-21-modest_saddle   thrL    255  gene     thrL    protein_coding  _up-thrL        .       .       NC_000913.3     RefSeq  190     +       TATGTCTCTGTGTGGATTAA    CAGCACCCCAGGAACCCATA  20              GCTTTTCATTCTGACTGCAA    19d8fdaa        modest_saddle   28      -166    R       NNRGVAN CGGGCAA 21   thrL-21-modest_saddle

...

NC_000913.3     crispio protospacer     180     199     .       +       .       sgr-b71e7fa7    thrL-199-bouncy_sabine  thrL    255  gene     thrL    protein_coding  b0001   .       .       NC_000913.3     RefSeq  190     +       CACCATTACCACCACCATCA    CAGATAAAAATTACAGAGTA  20              CACAACATCCATGAAACGCA    3e6eb3a0        bouncy_sabine   206     65      R       NNRGVAN TTAGCAC 199     thrL-199-bouncy_sabine

All the commands are pipeable. All the chatter goes to stderr, so you can pipe your actual data through stdout.

Mapping known guide RNAs to a genome

The command crispio map is similar, but takes known guides in FASTA format as input.

$ crispio map cv-nar-2020_TableS1.fasta -g EcoMG1655-NC_000913.3.fasta -a EcoMG1655-NC_000913.3.gff3 --pam Spy

πŸš€ Mapping sgRNAs with the following parameters:
       ...
Finding sgRNA sites matching 21417 sequences from cv-nar-2020_TableS1.fasta and matching PAM Spy (NGGN) in EcoMG1655-NC_000913.3.fasta...

##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
##genome-sequence       NC_000913.3     1       4641652
##genome-description    Escherichia coli str. K-12 substr. MG1655, complete genome
##genome-filename       EcoMG1655-NC_000913.3.fasta
#sgRNA-map      crispio EcoMG1655-NC_000913.3.fasta     EcoMG1655-NC_000913.3.gff3
NC_000913.3     crispio     2400946 2400965 .       +       .       ID=sgr-7a0a4f43;Name=nuoF-2400965-level_herman;ann_Name=nuoF;ann_end=2401555;ann_feature=gene;ann_gene=nuoF;ann_gene_biotype=protein_coding;ann_locus_tag=b2284;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=2400218;ann_strand=-;guide_context_down=TAGCACGACGTCCTTCCAGG;guide_context_up=CCATGCGCCGGAGGTTGCCG;guide_length=20;guide_re_sites=;guide_sequence=GGAAGGGTGGCTTCGAGCGT;guide_sequence_hash=6a359e86;mnemonic=level_herman;pam_end=2400969;pam_offset=0;pam_replichore=L;pam_search=NGGN;pam_sequence=GGGT;pam_start=2400965;source_name=GGAAGGGTGGCTTCGAGCGT

...

NC_000913.3     crispio protospacer     2400933 2400952 .       +       .       ID=sgr-32446b00;Name=nuoF-2400952-telling_austria;ann_Name=nuoF;ann_end=2401555;ann_feature=gene;ann_gene=nuoF;ann_gene_biotype=protein_coding;ann_locus_tag=b2284;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=2400218;ann_strand=-;guide_context_down=TTCGAGCGTGGGTTAGCACG;guide_context_up=AGGTCGGTTTACCCCATGCG;guide_length=20;guide_re_sites=;guide_sequence=CCGGAGGTTGCCGGGAAGGG;guide_sequence_hash=1908728f;mnemonic=telling_austria;pam_end=2400956;pam_offset=0;pam_replichore=L;pam_search=NGGN;pam_sequence=TGGC;pam_start=2400952;source_name=CCGGAGGTTGCCGGGAAGGG

If you don't have a FASTA to hand, but a table instead, you can pipe it through bioino table2fasta:

$ cat guide-table.tsv | bioino table2fasta -s sequence -n guide_name | crispio map -g EcoMG1655-NC_000913.3.fasta -a EcoMG1655-NC_000913.3.gff3 --pam Sth1 

##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
##genome-sequence       NC_000913.3     1       4641652
##genome-description    Escherichia coli str. K-12 substr. MG1655, complete genome
##genome-filename       EcoMG1655-NC_000913.3.fasta
#sgRNA-map      crispio EcoMG1655-NC_000913.3.fasta     EcoMG1655-NC_000913.3.gff3
NC_000913.3     crispio protospacer     1236    1255    .       +       .       ID=sgr-831073da;Name=thrA-1255-honest_brother;ann_Name=thrA;ann_end=2799;ann_feature=gene;ann_gene=thrA;ann_gene_biotype=protein_coding;ann_locus_tag=b0002;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=337;ann_strand=+;guide_context_down=TTCCAATCTGAATAACATGG;guide_context_up=CTCATTGGTGCCAGCCGTGA;guide_length=20;guide_re_sites=BbsI;guide_sequence=TGAAGACGAATTACCGGTCA;guide_sequence_hash=55934652;mnemonic=honest_brother;pam_end=1262;pam_offset=2462;pam_replichore=R;pam_search=NNRGVAN;pam_sequence=AGGGCAT;pam_start=1255;source_name=thrA-1255-honest_brother

...

NC_000913.3     crispio protospacer     3999    4018    .       +       .       ID=sgr-83d65199;Name=thrC-4018-jolly_lunar;ann_Name=thrC;ann_end=5020;ann_feature=gene;ann_gene=thrC;ann_gene_biotype=protein_coding;ann_locus_tag=b0004;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=3734;ann_strand=+;guide_context_down=TGTTCCACGGGCCAACGCTG;guide_context_up=CCCGGCTCCGGTCGCCAATG;guide_length=20;guide_re_sites=BtgZI;guide_sequence=TTGAAAGCGATGTCGGTTGT;guide_sequence_hash=e539f903;mnemonic=jolly_lunar;pam_end=4025;pam_offset=1286;pam_replichore=R;pam_search=NNRGVAN;pam_sequence=CTGGAAT;pam_start=4018;source_name=thrC-4018-jolly_lunar

Annotating with extra features

It may be useful to calcuate additional guide RNA features for downstream applications like machine learning. These are the available extra features:

>>> from crispio import *
>>> get_features()
['on_nontemplate_strand', 'context_up2', 'context_down2', 'context_up_autocorr', 'pam_n', 'pam_def', 'pam_gc', 'pam_autocorr', 'pam_scaff_corr', 'guide_purine', 'guide_gc', 'seed_seq', 'guide_start3', 'guide_end3', 'guide_autocorr', 'guide_scaff_corr']

Using the command-line these can be easily added to a GFF or piped output from crispio map or crispio generate.

$ cat mapped-guides.gff | head | crispio featurize --scaffold Spy

πŸš€ Featurizing sgRNAs with the following parameters:
 ...
> Generating features for guides...
##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
##genome-sequence       NC_000913.3     1       4641652
##genome-description    Escherichia coli str. K-12 substr. MG1655, complete genome
##genome-filename       EcoMG1655-NC_000913.3.fasta
#sgRNA-map      crispin EcoMG1655-NC_000913.3.fasta     EcoMG1655-NC_000913.3.gff3
NC_000913.3     crispin protospacer     2400946 2400965 .       +       .       ID=sgr-7a0a4f43;Name=nuoF-2400965-level_herman;ann_Name=nuoF;ann_end=2401555;ann_feature=gene;ann_gene=nuoF;ann_gene_biotype=protein_coding;ann_locus_tag=b2284;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=2400218;ann_strand=-;guide_context_down=TAGCACGACGTCCTTCCAGG;guide_context_up=CCATGCGCCGGAGGTTGCCG;guide_length=20;guide_re_sites=;guide_sequence=GGAAGGGTGGCTTCGAGCGT;guide_sequence_hash=6a359e86;mnemonic=level_herman;pam_end=2400969;pam_offset=0;pam_replichore=L;pam_search=NGGN;pam_sequence=GGGT;pam_start=2400965;source_name=GGAAGGGTGGCTTCGAGCGT;feat_on_nontemplate_strand=True;feat_context_up2=CG;feat_context_down2=TA;feat_context_up_autocorr=8.928;feat_pam_n=G;feat_pam_def=GGT;feat_pam_gc=0.750;feat_pam_autocorr=2.167;feat_pam_scaff_corr=1.917;feat_guide_purine=0.650;feat_guide_gc=0.650;feat_seed_seq=AGCGT;feat_guide_start3=GGA;feat_guide_end3=CGT;feat_guide_autocorr=8.704;feat_guide_scaff_corr=9.772

...

NC_000913.3     crispin protospacer     1764112 1764131 .       +       .       ID=sgr-f3815635;Name=sufA-1764131-scarce_game;ann_Name=sufA;ann_end=1764386;ann_feature=gene;ann_gene=sufA;ann_gene_biotype=protein_coding;ann_locus_tag=b1684;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=1764018;ann_strand=-;guide_context_down=ATCGCTTGCAGCGGGACAAA;guide_context_up=GTCCTTCACGAACGAAATCG;guide_length=20;guide_re_sites=;guide_sequence=ACTTCCGTGCCATCAATAAA;guide_sequence_hash=dc4688e0;mnemonic=scarce_game;pam_end=1764135;pam_offset=0;pam_replichore=R;pam_search=NGGN;pam_sequence=CGGC;pam_start=1764131;source_name=ACTTCCGTGCCATCAATAAA;feat_on_nontemplate_strand=True;feat_context_up2=CG;feat_context_down2=AT;feat_context_up_autocorr=7.159;feat_pam_n=C;feat_pam_def=GGC;feat_pam_gc=1.000;feat_pam_autocorr=2.333;feat_pam_scaff_corr=1.667;feat_guide_purine=0.450;feat_guide_gc=0.400;feat_seed_seq=ATAAA;feat_guide_start3=ACT;feat_guide_end3=AAA;feat_guide_autocorr=7.767;feat_guide_scaff_corr=10.528

The attributes starting with feat_ have been added.

Checking for off-targets

One downside of CRISPR-based tools is the possibility of off-target effects. crispio offtarget compares two GFF files, or one GFF file against itself, for guide RNA sites that share a seed sequence (4 PAM-proximal bases) and have a Hamming distance of 4 or less.

$ cat mapped-guides.gff | head | crispio offtarget -2 <(cat mapped-guides.gff)

##sequence-region NC_000913.3 1 4641652
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=511145
##genome-sequence       NC_000913.3     1       4641652
##genome-description    Escherichia coli str. K-12 substr. MG1655, complete genome
##genome-filename       EcoMG1655-NC_000913.3.fasta
#sgRNA-map      crispin EcoMG1655-NC_000913.3.fasta     EcoMG1655-NC_000913.3.gff3
#crosstalk-comparator   63
NC_000913.3     crispin protospacer     2400946 2400965 .       +       .       ID=sgr-7a0a4f43;Name=nuoF-2400965-level_herman;ann_Name=nuoF;ann_end=2401555;ann_feature=gene;ann_gene=nuoF;ann_gene_biotype=protein_coding;ann_locus_tag=b2284;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=2400218;ann_strand=-;guide_context_down=TAGCACGACGTCCTTCCAGG;guide_context_up=CCATGCGCCGGAGGTTGCCG;guide_length=20;guide_re_sites=;guide_sequence=GGAAGGGTGGCTTCGAGCGT;guide_sequence_hash=6a359e86;mnemonic=level_herman;pam_end=2400969;pam_offset=0;pam_replichore=L;pam_search=NGGN;pam_sequence=GGGT;pam_start=2400965;source_name=GGAAGGGTGGCTTCGAGCGT

...

NC_000913.3     crispin protospacer     1764112 1764131 .       +       .       ID=sgr-f3815635;Name=sufA-1764131-scarce_game;ann_Name=sufA;ann_end=1764386;ann_feature=gene;ann_gene=sufA;ann_gene_biotype=protein_coding;ann_locus_tag=b1684;ann_phase=.;ann_score=.;ann_seqid=NC_000913.3;ann_source=RefSeq;ann_start=1764018;ann_strand=-;guide_context_down=ATCGCTTGCAGCGGGACAAA;guide_context_up=GTCCTTCACGAACGAAATCG;guide_length=20;guide_re_sites=;guide_sequence=ACTTCCGTGCCATCAATAAA;guide_sequence_hash=dc4688e0;mnemonic=scarce_game;pam_end=1764135;pam_offset=0;pam_replichore=R;pam_search=NGGN;pam_sequence=CGGC;pam_start=1764131;source_name=ACTTCCGTGCCATCAATAAA

Python API

Some classes and functions are exposed in an API for generating guide RNAs in Python scripts.

Guides can be generated de novo.

from crispio import GuideLibrary

genome = "ATATATATATATATATATATATATACCGTTTTTTTAAAAAAACGGATATATATATATAATATATATATATAATATATATATATA"
gl = GuideLibrary.from_generating(genome=genome) 
for match_collection in gl:
    for guide in match_collection:
            print(guide)

The above code would return:

ATACCGTTTTTTTAAAAAAA
ATACCGTTTTTTTAAAAAAA

Or known guide sequences can be mapped to a genome.

from crispio import GuideLibrary

genome = "CCCCCCCCCCCTTTTTTTTTTAAAAAAAAAATGATCGATCGATCGAGGAAAAAAAAAACCCCCCCCCCC"
guide_seq = "ATGATCGATCGATCG"
gl = GuideLibrary.from_mapping(guide_seq=guide_seq, genome=genome) 

for collection in gl:
    for match in collection:
            print(match.as_dict())

This code would return:

{'pam_search': 'NGG', 'guide_seq': 'ATGATCGATCGATCG', 'pam_seq': 'AGG', 'pam_start': 45, 'reverse': False, 'guide_context_up': 'CTTTTTTTTTTAAAAAAAAA', 'guide_context_down': 'AAAAAAAAAACCCCCCCCCC', 'pam_end': 48, 'length': 15, 'guide_start': 30, 'guide_end': 45}

Check the full API in the documentation.

Issues, bugs, suggestions

Do not hesitate to add to our issue tracker.

Documentation

Check the documentation and full API here.