Skip to content

JonathanReeve/macro-etym

Repository files navigation

Gitpod ready-to-code

The Macro-Etymological Analyzer

Have you ever wanted to know what proportion of your text is words of Latinate or Germanic origin?

This is a command-line tool for macro-etymological text analysis. It looks up all the words in your text in an etymological dictionary (the Etymological Wordnet), and compiles statistics about them.

New Features in this version

  • The web interface has been replaced with a command-line interface, making the MEA scriptable and machine-readable and writable. A web front-end to the command-line interface will be possible in a future version.
  • It is now possible to analyze and compare multiple texts at a time.
  • Users can filter for only those language families they care about.

Installation

You can install this program with git and pip:

git clone https://github.com/JonathanReeve/macro-etym
cd macro-etym
pip install .

If you experience errors, you could try installing with pip3 instead:

pip3 install .

And you'll probably need some NLTK data, if you don't have it already:

python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('averaged_perceptron_tagger'); nltk.download('wordnet')"

Usage

To compute the macro-etymology of a text, just give the filename of a text in your current working directory:

macroetym moby-dick.text

               moby-dick.txt
Austronesian   0.050381
Balto-Slavic   0.028789
Celtic         0.115158
Germanic      35.710858
Hellenic       0.964445
Indo-Iranian   0.127153
Japonic        0.019193
Latinate      62.415431
Other          0.237513
Semitic        0.230315
Turkic         0.071974
Uralic         0.028789

To compare the macro-etymologies of two or more texts, supply them as arguments:

macroetym moby-dick.txt pride-and-prejudice.txt

To see that data represented in a chart (experimental), try appending --chart. Although you might be better off outputting it as a CSV (with --csv) and then making your own chart using spreadsheet software.

To see a full list of options, run:

macroetym --help

That should show you this screen:

Usage: macroetym [OPTIONS] FILENAMES...

  Analyzes a text(s) for the etymologies of its words, and tallies the
  words by origin language, and origin language family.

Options:
  --allstats           Get all etymological statistics about the file(s).
  --lang TEXT          Specify the language of the texts. Use ISO639-3 three-
                       letter language code. Default is English.
  --showfamilies TEXT  A comma-separated list of language families to show,
                       e.g. Latinate,Germanic
  --affixes            Don't ignore affixes. Default is to ignore them.
  --current            Don't ignore current language and its middle variants.
                       Default is to ignore them.
  -c, --csv            Print a machine-readable CSV instead of a pretty
                       table.
  --chart              Make a pretty graph of the results. For one text, a
                       pie; for multiple, a bar.
  --verbose            Show debugging messages.
  --help               Show this message and exit.

About

A tool for analyzing the word histories of a text.

Resources

License

Stars

Watchers

Forks

Packages

No packages published