A Bookworm query is a JSON object with keys describing the objects to be fetched. Each key is essentially a function argument. The syntax for queries is largely taken from MongoDB.
This definition is open for review, and will probably change before the 1.0 release.
I'm soliciting any comments. What statistics should be added? How should keys be arranged? What should they be named? Is the capitalization format driving you crazy?
You can think of each of the query keys as doing one of four things:
- Filtering the full library down to a set of interest;
- Grouping that set by one or more metadata fields.
- Calculating one more summary statistics for each of the groups defined by the previous set, either a count or a ratio
- Setting miscellaneous preferences for how filtering, grouping, and calculating should be done.
Some of these take one key, some take a few.
search_limits
is the workhorse function that lets you set the words or other fields to be searched. It has its own syntax: see 4.2.1 for details.
Unlike search_limits
, compare_limits
is rarely specified manually, but when used it allows particularly complicated queries. Many queries contain an implicit comparison: you wish to return the number of times a word in a set is used as a percentage of all the words used in that set. compare_limits
allows you to specify the comparison explicitly.
By default, compare_limits
will be the same as search_limits
, but with the words
key removed: this makes it trivial to search for the percentage of all words.
But this is not always the most immediately useful comparison. If you want to compare how often two words are used, you can put one in search_limits and one in compare_limits.
groups
is an array of metadata fields describing what metadata should be returned. Each entry represents an additional layer of complexity: for example, specifying "groups":["year"]
will group only be year, while "groups":["year","city"]
will group by both year and city.
Be very careful with the choices, because too many groups can quickly make a query unmanageable. If you have 100 of each, this could easily return a 10,000 row query. (Although interactions which do not exist in the source data will not be returned, so it will probably be somewhat lower.)
Possible fields include any of the user-defined metadata, as well as "unigram" or "bigram" to return wordcount data.
Grouping by "unigram" or "bigram" can be quite slow, and should only be attempted be attempted for the time being on subcorpora of, say, 1 million words or less at a time. (On larger corpora, you'll just up timing out.)
Ordinarily, each ratio summary statistic ("Percentage of Books," say) refers directly to the interaction of group A and group B. Sometimes, this is less than useful.
Ordinarily a query like
{"groups":["year","library"],"counttype":["TextPercent"]}
will give for each interaction of year and library the number of texts that come from that particular library in that year. That's not interesting. (By definition, it will always be 100%.
On the other hand,
{"groups":["year","*library"],"counttype":["TextPercent"]}
will drop the library grouping on the superset and give the percentage of all texts for that year that come from the library, so each column will sum to 100%;{"groups":["*year","library"],"counttype":["TextPercent"]}
will drop the year superset and give the percentage of all texts for that library that come from that year and library.- Finally,
{"groups":["*year","*library"], "counttype":["TextPercent"]}
will drop both and give the percentage of all texts for the library defined by search_limits or constrain_limits contained in each cell: the sum of all the TextPercent cells in the entire return set should be 100. (Though it may not be if year or library is undefined for some items).
Combining this syntax with that for defining a separate compare_limits
will produce some pretty nonsensical queries, so it's generally better to do just one or the other.
Example: "counttype":["WordsPerMillion"]
Counttype is an array of commands that specify what summary statistics will be returned.
The most commonly used values are:
- WordCount: The number of words matching the terms in
search_limits
for each group. (If no words key is specified, the sum of all the words in the book). - TextCount: The number of texts matching the constraints on search_limits for each group.
- WordsPerMillion: The number of words in the search_limits per million words in the broader set. (Words per million, rather than percent, gives a more legible number).
- TextPercent: The percentage of texts in the broader group matching the search terms.
Also permanently available are:
- TotalTexts: The number of texts matching the constraints on compare_limits. (By selecting TextCount and TotalTexts, you can derive TextPercent locally, if you prefer).
- TotalWords: The number of words in the larger set.
Currently available, and useful in some specialized cases involving comparisons, are:
- WordsRatio: equal to
WordCount/TotalWords
- SumWords: equal to
TotalWords + WordCount
WordsRatio and TextRatio - TextRatio: equal to
TextCount/TotalTexts
- SumTexts: equal to
TextCount+TotalTexts
Example: {"database":"ChronAm"}
A single server can contain several bookworms: this is a string describing which one to run queries on.
The type of results to be returned. For standard queries, this should be one of:
- return_json: a JSON-formatted result, consisting of nested dicts for each grouping in groups pointing to an array consisting of the results for each count in counttype.
- return_tsv: a tsv, with columns corresponding to each grouping in groups and each counttype in
counttype
.
There are also some special methods that overrride other settings:
- returnPossibleFields: gives a list of fields that can be used as
groups
or insearch_limits
with some data about their type. All fields but "database" are ignored. - search_results: returns an array of html strings, each of which can be displayed to the user that matches the current
search_limits
. "Groupings" is ignored, and "counttype" is used in a special way (seeordertype
). By default only the first 100 results are returned--there is currently no way to page past them.
In progress: comments welcome
When method
is "search_results", the books are sorted before being returned. This sort ordering can be controlled.
By default, results are sorted by the percentage of hits in the text. That biases towards either texts that use the words a lot, or texts that use it rarely.
Often you want not the top texts, but some representative texts. For this purpose.
Currently, random sorting is handled in an interesting way. If the counttype relies on the number or ratio of texts, it sorts the texts in random order.
If the counttype relies on the number of ratio of words, however, it tries to sort the texts randomly weighted by the number of times the words appear in it. This means that a random word from the first text should represent a random usage from the overall sample.
The current MySQL-python implementation uses an approximation for this: LOG(1-RAND())/sum(main.count)
that should mimic a weighted random ordering for most distributions, but in some cases it may not behave as intended.
In progress. True weighted random ordering will be more expensive in time but potentially useful.
Depending on the usefulness of search ordering, this could be extended to support:
- TF-IDF ordering: weight results by the distinctiveness of words, not just raw frequency.
Example: "words_collation":"case_sensitive"
A string representing how to handle case matching on the "words" term in groupings
Possible values:
- case_sensitive: match the string exactly as entered.
- case_insensitive: match case insensitive.
- stem: use the Porter stemming algorithm to find all words with the same stem: so "giraffes" will produce a hit when you search for "giraffe". Not supported on most new bookworms, but there are plans to restore it.
In progress: I'm inclined to think this should be eliminated and instead users could specify 'casesens','case_insens' or 'stem' directly, and the API would translate the results appropriately. It's slightly uglier, but would allow more complicated queries (such as mixing case sensitive and insensitive in the same limits, or using separate values for groupings and search limits)
If you build a web or analysis app using Bookworm, you're encouraged to use the dict to add other keys storing other elements of the state. For example, the layout preferences for the D3 bookworm are stored in an aesthetic
field which maps to a dictionary; and both GUIs use a field called smoothingSpan
to represent smoothing.
The advantage of doing this is state persistence for RESTful apps, portability, and helpfulness for the logs.
We may need to reserve a few keys for own use down the road. So if you do define something, avoid using the following unless you're contributing to a core project:
D3-bookworm reserved
- aesthetic
Future authentication needs
- key
- token
- user