A Clojure library designed to tell how often words appear in a text.
For the purposes of this library, a "word" is a sequence of letters, numbers, or an apostrophe. All punctuation and white space is ignored (except for the aforementioned apostrophe). The apostrophe is treated as part of a word to avoid "don't" being turned into the nonsensical "don" and "t".
A JDK is required to run and develop this application. To build yourself, you will need to install lein.
An "uberjar" is a single JAR file containing all required libraries which can be invoked relatively easily from the command line.
$ lein clean
$ lein uberjar
This will generate the file under target/concordance.jar
.
Once you have the concordance.jar
, you can run it from the command line using
the following.
$ java -jar concordance.jar path/to/file.txt > results.txt
Output includes the word, a space, then the number of times the word appears in the text.
Output will be directed to standard out, so be sure to pipe to a file.
If you pass the --help
flag, you can see the command-line options.
$ java -jar concordance.jar --help
Utility for counting the frequency of words in a text.
Usage: concordance [-s ORDER] text.txt
Options:
-s, --sort ORDER alpha Sorting order. Must be one of "alpha" or "freq".
-h, --help
If no sorting option is passed, output will be alphabetical by default.
$ java -jar concordance.jar common-sense.txt
'tis 9
a 451
ability 2
able 11
ablest 1
abound 1
about 5
above 5
abroad 2
abrupt 1
...
If -s freq
is passed, the output will be sorted by the most frequent words
first. When multiple words have the same frequency, they will be sorted
alphabetically.
$ java -jar concordance.jar -s freq les-misérable.txt
the 40569
of 19655
and 14788
a 14396
to 13777
in 11058
he 9588
was 8609
that 7778
it 6506
his 6444
...
When called from Clojure code, the library exposes a word-count
function, as
well as the sorting functions, alphabetical-order
and frequency-order
.
Word count accepts a single string and returns a map of words to frequency values.
(require '[concordance.core :as concordance])
(def meditation "No man is an island entire of itself; every man
is a piece of the continent, a part of the main;
if a clod be washed away by the sea, Europe
is the less, as well as if a promontory were, as
well as any manner of thy friends or of thine
own were; any man's death diminishes me,
because I am involved in mankind.
And therefore never send to know for whom
the bell tolls; it tolls for thee.")
(def counts (concordance/word-count meditation))
{"itself" 1 "thine" 1 "of" 5 "involved" 1 "continent" 1 "part" 1
"promontory" 1 "every" 1 "it" 1 "send" 1 "by" 1 "is" 3 "europe" 1 "away" 1
"sea" 1 "friends" 1 "for" 2 "thy" 1 "whom" 1 "therefore" 1 "because" 1
"any" 2 "were" 2 "main" 1 "if" 2 "man" 2 "diminishes" 1 "an" 1 "or" 1
"am" 1 "a" 4 "tolls" 2 "never" 1 "own" 1 "manner" 1 "bell" 1 "death" 1
"thee" 1 "entire" 1 "be" 1 "and" 1 "piece" 1 "i" 1 "less" 1 "island" 1
"no" 1 "well" 2 "clod" 1 "washed" 1 "to" 1 "mankind" 1 "know" 1 "as" 4
"me" 1 "the" 5 "in" 1 "man's" 1}
The exposed
Comparator
functions are designed to work with the core sort-by
function.
(sort-by concordance/frequency-order counts)
(["of" 5] ["the" 5] ["a" 4] ["as" 4] ["is" 3] ...)
Concordance is designed to run against a single string or file at a time. As
such, it will load an entire text into memory in order to generate the
concordance map. The text is normalized (converted to lowercase), and then
broken into words. The resulting map will have an entry for each unique word.
(So, for worst case, (word-count (slurp "/usr/share/dict/words"))
.) Sorting
this then performed against the resulting map.
On my (slow) computer, this ends up being pretty reasonable. Generating and
sorting the concordance for "Les Misérables" (one of the longest English books
in the public domain) in about 1.6 seconds (plus JVM start-up overhead). A
concordance for the words
file (235,886 words on my laptop) takes about 4
seconds.
A more memory efficient approach would be to accept a stream of strings (or lines). The downside to this approach is that it would be more complex, since it would prevent using some core Clojure functions which would have to be re-written to implement the same resulting functionality.
Copyright © 2017 Michael S. Daines
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.