Skip to content

R script for finding and scoring motifs in small genomes using position-frequency tables

Notifications You must be signed in to change notification settings

acvill/findMotifs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

findMotifs

findMotifs is an easy-to-implement R script for finding and scoring motifs in small genomes or metagenomic contigs using position-frequency tables, such as those generated by HT-SELEX experiments. For large genomes, you're better off using motifmatchr or similar software that uses the MOODS C++ library. For the discovery of novel motifs from sequencing data, consider HOMER or rmotifx.

Inputs

  1. fasta file (-f,--fasta)
  2. position-frequency table (-t,--freq_table)
  3. minimum score for matches (-c,--cutoff; default is 90)
  4. output directory (-o,--out_dir; default is getwd())

The position-frequency table must be a concatenated list of comma-delimited position-frequency matrices, with the name of each motif preceding its corresponding matrix. For an example, see the structure of PFMs.txt:

motif_A
A,89,0,41,0,0,28,34,0,23,34,0,90
C,9,94,9,11,95,0,0,92,28,51,100,0
G,2,0,24,0,5,72,47,8,17,15,0,0
T,0,6,26,89,0,0,19,0,32,0,0,10
motif_B
A,0,48,8,0,0,100,37,28,23,0,29,1,0
C,100,0,35,24,0,0,0,9,34,100,71,0,0
G,0,52,41,16,100,0,63,36,35,0,0,0,100
T,0,0,16,60,0,0,0,27,8,0,0,99,0
motif_C
A,0,0,93,0,31,25,0,0,0,29
C,0,71,0,0,0,52,0,0,100,59
G,7,0,0,47,69,23,90,0,0,0
T,93,29,7,53,0,0,10,100,0,12

Trailing commas are OK, as all empty fields get converted to NA and trimmed during import.

Outputs

This script returns a single tab-delimited file named with the following fields:

Field Description
motif_name the name of the motif as it appears in the position-frequency table
motif_width the number of nucleotides in the motif
fasta the name of the input fasta
fasta_length the number of nucleotides in the input fasta
strand the strand the match is found on (+ or -)
match the nucleotide sequence of the match
score the score of the match as a proportion of the max score (0 to 1)
concat_start the position of the first nucleotide of the match in the + concatenated fasta file
concat_end the position of the last nucleotide of the match in the + concatenated fasta file. For matches on the + strand, concat_start < concat_end. For matches on the - strand, concat_start > concat_end.
contig_index the contig number on which the match is located
contig_length the number of nucleotides in the contig containing the match
contig_start the position of the first nucleotide in the match relative to the + start of the contig
contig_end the position of the last nucleotide in the match relative to the + start of the contig. For matches on the + strand, contig_start < contig_end. For matches on the - strand, contig_start > contig_end.
contig_header the defline of the contig

Dependencies

R (version >= 4.0.0) with the following packages:

Example

To run findMotifs.R from the command line, ensure R is in your PATH and call the script with Rscript.

export PATH=/programs/R-4.0.0/bin:$PATH  
Rscript findMotifs.R -f EcoliC.fa -t PFMs.txt -c 90

Progress will print to the command line, and results will be written to <fasta>_<cutoff>_findmotifs.txt in the directory specified by --out_dir. To check for correct behavior, run the script using the example files and compare the output to EcoliC.fa_90_findmotifs.txt using md5sum or an equivalent checksum method. Because progress is monitored with txtProgressBar(), jobs submitted to a cluster will produce a jumbled outfile log.

Use -h or --help to print the help message.

Rscript findmotifs.R -h
Options:
    -f CHARACTER, --fasta=CHARACTER
            Required: input directory and filename of fasta sequence

    -t CHARACTER, --freq_table=CHARACTER
            Required: input directory and filename of comma-delimited position frequency table

    -c CHARACTER, --cutoff=CHARACTER
            Input cutoff as integer <= 100 (default: 90)

    -o CHARACTER, --out_dir=CHARACTER
            Output directory without trailing "/" (default: getwd())

    -h, --help
            Show this help message and exit

Details

This script is a wrapper for the PWM() and matchPWM() functions from the Biostrings package. (see RDocumentation)

PWM() is used to convert the input position-frequency matrix into a position-weight matrix using type = "log2probratio". The parameters of the Dirichlet conjugate prior are adjusted to account for the GC content of the input fasta. To save time, the number of matchPWM() calls are reduced to one per motif by first concatenating the input fasta into a single sequence with contigs demarcated by non-IUPAC characters.

There are three parameters that affect the runtime of the script.

  1. The length of the input fasta
  2. The number of motifs represented in the position-frequency table
  3. The information content of each motif. Motifs with lower information content will have lower maximum scores, and therefore more matches above the cutoff threshold.

Future implementations should further increase speed by parallelizing matchPWM() calls.

About

R script for finding and scoring motifs in small genomes using position-frequency tables

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages