Skip to content

Munro alpha

Pre-release
Pre-release
Compare
Choose a tag to compare
@bmschmidt bmschmidt released this 14 Mar 22:37

This is the first version of a release that should be much faster in a number of ways. It's been tested on a couple sets, but is not yet thoroughly debugged. I'm tagging it as a pre-release for that version, but would love for anyone to try it.

I'm calling it Munro because we should name these releases after authors, obviously, and the distinguishing feature of this release is that it better handles large numbers of short texts much better than the old ones.

Speed Improvements

The big change is that where previous versions required individual files for every text, the new version handles texts in chunks which greatly reduces the number of disk reads required. I haven't fully benchmarked it, but for the History dissertation set (30,000 extremely small texts) it reduced the overall build time (including database formatting, which hasn't changed) from about 30 seconds to 8 seconds.

Relatedly, it enables a new input format; instead of a directory of files, you can now upload a single text file with your entire archive. The first word of each line is the filename; it should be followed by a tab and the full text of the text identified by the filename. One outstanding question is whether lines past a certain size might somehow cause python to break. It's certainly not possible to use a file larger than the computer's available RAM; but it's hard to imagine that happening nowadays.

The intermediary stages which stored tables of wordcounts for every text now use the cPickle module to store batches (specifically, a new python object of class tokenBatches of the counts for about 10MB worth of texts at a time. That 10MB number is arbitrary, but seems to work: we might want to dynamically increase how it's chosen. For one Bookworm with about 1.6m documents of about a paragraph each, that reduces the number of tokenization files down to 88. The wordcounting and tokenization scripts use native methods of the tokenBatches object to do their counting and encoding; avoid the need to parse to and from CSV's seems to help speed things up a bit more.

Code Improvements

Tokenization is now handled by a new python submodule, bookworm.tokenizer. The old version was a tangle of perl code with all sorts of Internet Archive specific gobbledegook based around substitutions that tried to surround words with spaces and capture English sentence-ending code; now, it takes a MALLET-inspired approach and simply tries to capture all the elements of a word with one big regex. (In fact, it exports an object called bigregex that constitutes the Bookworm definition of a token: that has previously been a really opaque definition even for those who know the code, so this is a big step towards transparency.)

The old python code seemed to have been dropping unicode characters at a certain point: they are hopefully once again supported.

A number of extraneous old modules and scripts, including tokenizeAndEncodeFiles and the various perl scripts in /scripts, have been cleared out.

Architecture changes

One last change, which is more in the neutral category.

This version continues a transition most of my edits over the last year or so, farther away from old model where the python script OneClick.py spawns everything, and towards a layout where a Makefile defines all the targets. This is primarily due to the issues we've had with getting sensible parallelization to work in python, but has some nice sideffects as well; Make handles the progressive stages of builds quite nicely, and GNU parallel distributes jobs across processors extremely simply.

For the end user, the difference is simply that instead of calling python OneClick.py bookworm username password, you call make all bookwormName=bookworm. Some form of automatic web site creation should be included shortly.

New Dependencies

This requires a couple new objects on any system.

Python packages:

  1. regex (To handle unicode regular expressions)
  2. cPickle

System utilities

  1. GNU parallel (Oversees job creation very easily)