findMotifs is an easy-to-implement R script for finding and scoring motifs in small genomes or metagenomic contigs using position-frequency tables, such as those generated by HT-SELEX experiments. For large genomes, you're better off using motifmatchr or similar software that uses the MOODS C++ library. For the discovery of novel motifs from sequencing data, consider HOMER or rmotifx.
- fasta file (
-f
,--fasta
) - position-frequency table (
-t
,--freq_table
) - minimum score for matches (
-c
,--cutoff
; default is 90) - output directory (
-o
,--out_dir
; default isgetwd()
)
The position-frequency table must be a concatenated list of comma-delimited position-frequency matrices, with the name of each motif preceding its corresponding matrix. For an example, see the structure of PFMs.txt
:
motif_A
A,89,0,41,0,0,28,34,0,23,34,0,90
C,9,94,9,11,95,0,0,92,28,51,100,0
G,2,0,24,0,5,72,47,8,17,15,0,0
T,0,6,26,89,0,0,19,0,32,0,0,10
motif_B
A,0,48,8,0,0,100,37,28,23,0,29,1,0
C,100,0,35,24,0,0,0,9,34,100,71,0,0
G,0,52,41,16,100,0,63,36,35,0,0,0,100
T,0,0,16,60,0,0,0,27,8,0,0,99,0
motif_C
A,0,0,93,0,31,25,0,0,0,29
C,0,71,0,0,0,52,0,0,100,59
G,7,0,0,47,69,23,90,0,0,0
T,93,29,7,53,0,0,10,100,0,12
Trailing commas are OK, as all empty fields get converted to NA
and trimmed during import.
This script returns a single tab-delimited file named with the following fields:
Field | Description |
---|---|
motif_name |
the name of the motif as it appears in the position-frequency table |
motif_width |
the number of nucleotides in the motif |
fasta |
the name of the input fasta |
fasta_length |
the number of nucleotides in the input fasta |
strand |
the strand the match is found on (+ or - ) |
match |
the nucleotide sequence of the match |
score |
the score of the match as a proportion of the max score (0 to 1) |
concat_start |
the position of the first nucleotide of the match in the + concatenated fasta file |
concat_end |
the position of the last nucleotide of the match in the + concatenated fasta file. For matches on the + strand, concat_start < concat_end . For matches on the - strand, concat_start > concat_end . |
contig_index |
the contig number on which the match is located |
contig_length |
the number of nucleotides in the contig containing the match |
contig_start |
the position of the first nucleotide in the match relative to the + start of the contig |
contig_end |
the position of the last nucleotide in the match relative to the + start of the contig. For matches on the + strand, contig_start < contig_end . For matches on the - strand, contig_start > contig_end . |
contig_header |
the defline of the contig |
R (version >= 4.0.0) with the following packages:
- Biostrings (tested version: 2.56.0)
- optparse (tested version: 1.6.6)
- stringr (tested version: 1.4.0)
To run findMotifs.R
from the command line, ensure R is in your PATH
and call the script with Rscript
.
export PATH=/programs/R-4.0.0/bin:$PATH
Rscript findMotifs.R -f EcoliC.fa -t PFMs.txt -c 90
Progress will print to the command line, and results will be written to <fasta>_<cutoff>_findmotifs.txt
in the directory specified by --out_dir
. To check for correct behavior, run the script using the example files and compare the output to EcoliC.fa_90_findmotifs.txt
using md5sum
or an equivalent checksum method. Because progress is monitored with txtProgressBar()
, jobs submitted to a cluster will produce a jumbled outfile log.
Use -h
or --help
to print the help message.
Rscript findmotifs.R -h
Options:
-f CHARACTER, --fasta=CHARACTER
Required: input directory and filename of fasta sequence
-t CHARACTER, --freq_table=CHARACTER
Required: input directory and filename of comma-delimited position frequency table
-c CHARACTER, --cutoff=CHARACTER
Input cutoff as integer <= 100 (default: 90)
-o CHARACTER, --out_dir=CHARACTER
Output directory without trailing "/" (default: getwd())
-h, --help
Show this help message and exit
This script is a wrapper for the PWM()
and matchPWM()
functions from the Biostrings
package. (see RDocumentation)
PWM()
is used to convert the input position-frequency matrix into a position-weight matrix using type = "log2probratio"
. The parameters of the Dirichlet conjugate prior are adjusted to account for the GC content of the input fasta. To save time, the number of matchPWM()
calls are reduced to one per motif by first concatenating the input fasta into a single sequence with contigs demarcated by non-IUPAC characters.
There are three parameters that affect the runtime of the script.
- The length of the input fasta
- The number of motifs represented in the position-frequency table
- The information content of each motif. Motifs with lower information content will have lower maximum scores, and therefore more matches above the cutoff threshold.
Future implementations should further increase speed by parallelizing matchPWM()
calls.