Find whole sentences matching a regex in Project Gutenberg plain text files.
gutengrep.py "^[^\w]*And then" "*.txt" --cache --sort --correct -o output/and-then.txt
gutengrep.py "^[^\w]*But why" "*.txt" --cache --sort --correct -o output/but-why.txt
gutengrep.py -i "whale" moby11.txt --sort --correct -o out\mobydick-whale.txt
Name | Sorted | Regex | Input | Word count |
---|---|---|---|---|
But why? | But why? | ^[^\w]*But why |
*.txt |
7,572 |
And then! | And then! | [^\w]*And then |
*.txt |
85,014 |
The whale | The whale | whale |
moby11.txt |
50,913 |
Why | Why | [^\w]*Why |
*.txt |
184,832 |
Once upon a time | Once upon a time | -i once upon a time |
*.txt |
6,195 |
The End | The End | -i the end\. |
*.txt |
142,94 |
Happily ever after | Happily ever after | -i happily ever after |
*.txt |
271 |
Moonlit | Moonlit | -i moonlit |
*.txt |
52,345 |
Moonlight | Moonlight | -i moonlight |
*.txt |
3,186 |
See also nanogenmo.md.
Download the Project Gutenberg August 2003 CD (download and mount the ISO file) and copy all the text files from the 'etext' directories to your hard drive, and put all of the text files in the same directory.
When working on the whole corpus, use --cache
to cut down on file operations. The first time it will build a cache file of all tokenised sentences. This first pass takes about 5 minutes on my MBP to go through the 597 books of the Project Gutenberg CD and extract its 3,583,390 sentences. Subsequent runs using the cache take about 40 seconds.
If searching just a single file, or a subset of files, make sure not to use --cache
because it will use the cache file generated on the initial file spec.