Initial import of files for _Anne of Green Garbles_.

catseye · Nov 25, 2019 · 1d4cdf2 · 1d4cdf2
1 parent 1d1316d
commit 1d4cdf2
Show file tree

Hide file tree

Showing 28 changed files with 3,524 additions and 1 deletion.
diff --git a/Anne of Green Garbles/.gitignore b/Anne of Green Garbles/.gitignore
@@ -0,0 +1,4 @@
+*.json
+*.txt
+__pycache__/
+aogg/
diff --git a/Anne of Green Garbles/README.md b/Anne of Green Garbles/README.md
@@ -0,0 +1,184 @@
+Anne of Green Garbles
+=====================
+
+Herein can be found the above-named generated novel, along with the
+code for generating it, the both of which were produced for [NaNoGenMo 2019][].
+
+The generator is written as a collection of command-line programs
+written in Python 3.6.  The only Python dependency is
+[Beautiful Soup][], used to extract the text from a source HTML file.
+
+After downloading [the HTML version of _Anne of Green Gables_](http://www.gutenberg.org/files/45/45-h/45-h.htm)
+from Project Gutenberg and saving it as `45-h.htm`, the novel
+was generated by running the following command:
+
+    ./build.py aogg 45-h.htm --random-seed=6010 --markov-order=1 --title "Anne of Green Garbles"
+
+According to `wc -w`, the resulting novel,
+_[Anne of Green Garbles](generated/Anne%20of%20Green%20Garbles.md)_,
+comprises 54,638 words.
+
+The software versions used were cPython 3.6.2, with
+[beautifulsoup4](https://pypi.org/project/beautifulsoup4/) 4.6.0,
+on Ubuntu 16.04.
+
+Theory of Operation
+-------------------
+
+The main goal of this project was to see how additional grammatical structure
+could be imposed on a [Markov chain][].
+
+The main result is that, if the desired additional structure can be captured by
+a [regular grammar][], then, observing that both the regular grammar and the
+Markov chain can be thought of as [probabilistic finite automata][] (with the
+probabilities in the regular grammar being always either 0% or 100%), we can
+apply the standard [product construction][] on these two automata to obtain an
+automaton that represents a language which is the intersection of the two
+original languages.
+
+As a concrete example, if the Markov chain is obtained from analyzing
+_Anne of Green Gables_, and the regular grammar is one which describes
+some basic rules of punctuation in English writing, as shown below
+(in the form of the corresponding [finite automaton][]):
+
+![Abbreviated diagram of punctuation automaton](images/narration-dialogue-parenthetical.png?raw=true)
+
+...then the resulting automaton is a Markov chain based on the word
+frequencies in _Anne of Green Gables_ but restricted to producing
+only those strings which correctly follow the given rules of punctuation.
+
+We might call the automaton resulting from this construction a
+"tagged Markov chain", because each token is tagged with the state of
+the regular grammar in which it was encountered.  The construction is
+compatible with higher-order Markov chains; the tokens which precede a
+token can be thought of as additional tags on that token.  Conversely,
+the state of the regular grammar could be thought of as a kind of
+abstract, "long-distance" higher-order Markov chain.
+
+Just like when a conventional higher-order Markov chain is constructed,
+the result of this construction is itself a Markov chain; it retains
+the [Markov property][].  This would not be the case if the grammar we
+wished to combine with the Markov chain was not a regular grammar
+(e.g. if it was a context-free grammar), as we would need to "remember"
+how many levels of nesting we had entered previously.
+
+Implementation
+--------------
+
+The programs that comprise this generator are written in a style that is amenable
+to usage in shell pipelines.  However, for normal usage, running them via
+the supplied `build.py` script is recommended.
+
+Scripts whose names start with `extract` gather usable information from some
+corpus or human-readable source, such as a web page downloaded from
+Project Gutenberg, and output it in an intermediate format.
+
+Scripts whose names start with `xform` are transformer filters
+which take a file in (usually on `stdin`) and produce another
+file (usually on `stdout`); these file are both in intermediate
+formats, usually the same intermediate format.
+
+Scripts whose names start with `render` are filters which
+take a file in intermediate format and produce a file in
+some human-readable output format, such as Markdown.
+
+Some formats used by these tools are:
+
+### HTML
+
+Intended as an input format only, this is any sort of HTML that
+BeautifulSoup can make sense of.
+
+### Markdown
+
+Intended as an output format only; as the final step of the pipeline
+the text is rendered as Markdown.  (It can then be converted to
+HTML, etc., by various other tools.)
+
+### tokenstream
+
+An intermediate format.
+
+A text file in UTF-8 encoding, with LF line endings, with one
+token per line.  A token is a lexeme, a sequence of characters
+that we treat as a "unit": a word or a bit of punctuation which
+is not part of some other word.
+
+A tokenstream is merely a special case of tagged tokenstream
+(see below) where there are no tags on any tokens.
+
+### tagged tokenstream
+
+An intermediate format.
+
+Like a tokenstream, but the token is preceded by a space and
+zero or more tags; each tag is a string of the form `A=B`
+where A and B can be any tokens which do not contain spaces
+or `=`s.
+
+Some meaningful tags are: `state`, the state of the automaton
+representing the regular grammar; and `prev1`, the previous
+token, for constructing an order-2 Markov chain.
+
+The set of tags in a tagged tokenstream is generally
+homongenous: every token in the stream has the same set of tags,
+only potentially with different values for each tag.
+
+### Model JSON
+
+An intermediate format.
+
+A JSON file containing a map tagged tokens to maps of
+tagged tokens to frequencies.
+
+More Notes
+----------
+
+The generator here supports some capabilities that in the end
+were not used in the generation of _Anne of Green Garbles_.
+
+The `build.py` script supports being given multiple input
+files (HTML documents), and builds the Markov chain model
+as if this were a single long input text.
+
+The `xform-tag-with-prev-tokens.py` filter implements an
+order-2 Markov chain, by tagging each token with the
+token immediately preceding it.  This was to confirm that
+the intersection construction works with higher-order
+Markov chains.
+
+The reason _Anne of Green Garbles_ was produced using only
+_Anne of Green Gables_ as its input text, with a Markov chain
+of order 1, was for aesthetic purposes -- I found the contrast
+of orderly dialogue/narration structure against the decidedly
+gibberishy output of the order-1 chain on a single work, to be
+more striking.  (More [Jess][]-esque, perhaps.)
+
+In addition, the regular grammar used in this generator also
+handles parenthetical remarks (as you can see from the diagram
+above.)  This literary device is, however, used only rarely
+in Lucy Maud Montgomery's works.  Lucky for us though, Sax
+Rohmer wasn't shy about putting in a few parentheses here and
+there.  So, to demonstrate all three of these features, here is
+an excerpt from _The Incoherent Dr. Fu Manchu_, which uses an
+order-2 Markov chain on several of Sax Rohmer's works, some
+of which do employ parenthetical phrases.
+
+> It wanted only three minutes or more above the nauseating odor
+> of burning Indian hemp; and despite the wide, so characteristic
+> of the one to arrest with the marked changes (corresponding with
+> phases of the one who watched him proved too potent for his
+> elusive courage. He wrote on). This was still unreal to me, and
+> began very slowly, “This marks a new toy it does make you
+> understand?” he declared. “It was not until I realized that he
+> hoped to lure us.”
+
+[NaNoGenMo 2019]: https://github.com/NaNoGenMo/2019/
+[Beautiful Soup]: https://www.crummy.com/software/BeautifulSoup/
+[Markov chain]: https://en.wikipedia.org/wiki/Markov_chain
+[Markov property]: https://en.wikipedia.org/wiki/Markov_property
+[regular grammar]: https://en.wikipedia.org/wiki/Regular_grammar
+[finite automaton]: https://en.wikipedia.org/wiki/Finite-state_machine#Mathematical_model
+[probabilistic finite automata]: https://en.wikipedia.org/wiki/Probabilistic_automaton
+[product construction]: https://en.wikipedia.org/wiki/File:Intersection1.png
+[Jess]: https://whitney.org/collection/works/9517
diff --git a/Anne of Green Garbles/build.py b/Anne of Green Garbles/build.py
@@ -0,0 +1,113 @@
+#!/usr/bin/env python3.6
+
+"""
+
+Script to orchestrate building.
+
+"""
+
+from argparse import ArgumentParser
+import sys
+import os
+from subprocess import run
+
+
+OPTIONS = None
+
+
+def zrun(targ, command, **kwargs):
+    if targ is not None:
+        targ = targ.format(**kwargs)
+        if os.path.exists(targ):
+            return
+    command += ' 2>>{bucket}/stderr.log'
+    c = command.format(**kwargs)
+    sys.stderr.write("*** {}\n".format(c))
+    run(c, shell=True, check=True)
+
+
+def process_pipeline(bucket, random_seed):
+    try:
+        os.mkdir(bucket)
+    except FileExistsError:
+        pass
+
+    context = dict(
+        bucket=bucket,
+    )
+    zrun(None, 'rm -f {bucket}/stderr.log', **context)
+    if OPTIONS.clean_gen:
+        zrun(None, 'rm -f {bucket}/out1.txt {bucket}/out2.txt {bucket}/novel.md {bucket}/novel.html', **context)
+
+    instances = []
+    for num, filename in enumerate(OPTIONS.filenames):
+        print(filename)
+        context = dict(
+            filename=filename,
+            bucket=bucket,
+            num=str(num),
+            fsm=OPTIONS.state_machine,
+        )
+        zrun('{bucket}/tokens_{num}_raw.txt', './extract-tokenstream.py "{filename}" > {bucket}/tokens_{num}_raw.txt', **context)
+        zrun('{bucket}/tokens_{num}_elim.txt', './xform-eliminate.py "{filename}" < {bucket}/tokens_{num}_raw.txt > {bucket}/tokens_{num}_elim.txt', **context)
+        zrun('{bucket}/tokens_{num}_deduped.txt', './xform-dedup.py < {bucket}/tokens_{num}_elim.txt > {bucket}/tokens_{num}_deduped.txt', **context)
+        zrun('{bucket}/tokens_{num}_sentences.txt', './xform-fix-fullstops.py < {bucket}/tokens_{num}_deduped.txt > {bucket}/tokens_{num}_sentences.txt', **context)
+        zrun('{bucket}/tokens_{num}_capitalized.txt', './xform-fix-capitalization.py < {bucket}/tokens_{num}_sentences.txt > {bucket}/tokens_{num}_capitalized.txt', **context)
+        zrun('{bucket}/tokens_{num}_end-punctuation.txt', './xform-fix-end-punctuation.py < {bucket}/tokens_{num}_capitalized.txt > {bucket}/tokens_{num}_end-punctuation.txt', **context)
+        zrun('{bucket}/tokens_{num}_contractions.txt', './xform-fix-apostrophes.py < {bucket}/tokens_{num}_end-punctuation.txt > {bucket}/tokens_{num}_contractions.txt', **context)
+        zrun('{bucket}/tokens_{num}_singlequotes.txt', './xform-eliminate-singlequotes.py < {bucket}/tokens_{num}_contractions.txt > {bucket}/tokens_{num}_singlequotes.txt', **context)
+        zrun('{bucket}/tokens_{num}_straightquotes.txt', './xform-fix-straightquotes.py < {bucket}/tokens_{num}_singlequotes.txt > {bucket}/tokens_{num}_straightquotes.txt', **context)
+        zrun('{bucket}/tokens_{num}_clean.txt', './xform-fix-doublequotes.py < {bucket}/tokens_{num}_straightquotes.txt > {bucket}/tokens_{num}_clean.txt', **context)
+        if OPTIONS.markov_order == 2:
+            zrun('{bucket}/tokens_{num}_prev1_tagged.txt', './xform-tag-with-prev-tokens.py < {bucket}/tokens_{num}_clean.txt > {bucket}/tokens_{num}_prev1_tagged.txt', **context)
+        elif OPTIONS.markov_order == 1:
+            zrun('{bucket}/tokens_{num}_prev1_tagged.txt', 'cat < {bucket}/tokens_{num}_clean.txt > {bucket}/tokens_{num}_prev1_tagged.txt', **context)
+        else:
+            raise NotImplementedError(OPTIONS.markov_order)
+        zrun('{bucket}/tokens_{num}_state_tagged.txt', './xform-tag-with-state.py --state-machine={fsm} < {bucket}/tokens_{num}_prev1_tagged.txt > {bucket}/tokens_{num}_state_tagged.txt', **context)
+        instances.append(num)
+
+    context = dict(
+        bucket=bucket,
+        instreams=' '.join(['{bucket}/tokens_{i}_state_tagged.txt'.format(bucket=bucket, i=i) for i in instances]),
+        random_seed=str(random_seed),
+        fsm=OPTIONS.state_machine,
+        title=OPTIONS.title,
+    )
+    zrun('{bucket}/model.json', 'cat {instreams} | ./create-model.py > {bucket}/model.json', **context)
+    zrun('{bucket}/out1.txt', './gen-from-model.py --chapter-count=30 --paragraph-count=35 --title="{title}" --random-seed={random_seed} --state-machine={fsm} {bucket}/model.json > {bucket}/out1.txt', **context)
+    zrun('{bucket}/out2.txt', './xform-untag.py < {bucket}/out1.txt > {bucket}/out2.txt', **context)
+    zrun('{bucket}/out3.txt', './xform-remove-short-sentences.py < {bucket}/out2.txt > {bucket}/out3.txt', **context)
+    zrun('{bucket}/novel.md', './render-markdown.py < {bucket}/out2.txt > {bucket}/novel.md', **context)
+    zrun(None, 'wc -w {bucket}/novel.md', **context)
+    zrun('{bucket}/novel.html', 'markdown_py < {bucket}/novel.md > {bucket}/novel.html', **context)
+    zrun(None, 'firefox {bucket}/novel.html &', **context)
+
+
+def main(args):
+    global OPTIONS
+
+    argparser = ArgumentParser()
+    argparser.add_argument('dirname', metavar='DIRNAME', type=str)
+    argparser.add_argument('filenames', metavar='FILENAME', type=str, nargs='+')
+
+    argparser.add_argument("--title", type=str, default="Generated Novel")
+    argparser.add_argument("--state-machine", type=str, default='dialogue')
+    argparser.add_argument("--random-seed", type=str, default='9009')
+    argparser.add_argument("--clean-gen", action='store_true')
+    argparser.add_argument("--markov-order", type=int, default=1)
+
+    OPTIONS = argparser.parse_args(args)
+
+    if OPTIONS.random_seed == 'random':
+        import random
+        random_seed = random.randint(0, 1000000)
+        sys.stderr.write(">>> Random seed is: {}".format(random_seed))
+    else:
+        random_seed = int(OPTIONS.random_seed)
+
+    process_pipeline(OPTIONS.dirname, random_seed)
+
+
+if __name__ == '__main__':
+    main(sys.argv[1:])
diff --git a/Anne of Green Garbles/create-model.py b/Anne of Green Garbles/create-model.py
@@ -0,0 +1,35 @@
+#!/usr/bin/env python3.6
+
+"""
+
+*   input format: tagged tokenstream
+*   output format: Model JSON
+
+Builds a Markov chain model given a (tagged) tokenstream.
+
+"""
+
+import json
+import re
+import sys
+
+
+def main(argv):
+    entries = {}
+    last = None
+
+    for line_no, line in enumerate(sys.stdin):
+        components = line.strip().split(' ')
+        word = components[-1]
+        tags = ' '.join(sorted(components[:-1]))
+        entry = '{} {}'.format(tags, word)
+        if last is not None:
+            m = entries.setdefault(last, {})
+            m[entry] = m.get(entry, 0) + 1
+        last = entry
+
+    print(json.dumps(entries, sort_keys=True, indent=4))
+
+
+if __name__ == '__main__':
+    main(sys.argv)