Skip to content

Commit

Permalink
Initial import of files for _Anne of Green Garbles_.
Browse files Browse the repository at this point in the history
  • Loading branch information
cpressey committed Nov 25, 2019
1 parent 1d1316d commit 1d4cdf2
Show file tree
Hide file tree
Showing 28 changed files with 3,524 additions and 1 deletion.
4 changes: 4 additions & 0 deletions Anne of Green Garbles/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
*.json
*.txt
__pycache__/
aogg/
184 changes: 184 additions & 0 deletions Anne of Green Garbles/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
Anne of Green Garbles
=====================

Herein can be found the above-named generated novel, along with the
code for generating it, the both of which were produced for [NaNoGenMo 2019][].

The generator is written as a collection of command-line programs
written in Python 3.6. The only Python dependency is
[Beautiful Soup][], used to extract the text from a source HTML file.

After downloading [the HTML version of _Anne of Green Gables_](http://www.gutenberg.org/files/45/45-h/45-h.htm)
from Project Gutenberg and saving it as `45-h.htm`, the novel
was generated by running the following command:

./build.py aogg 45-h.htm --random-seed=6010 --markov-order=1 --title "Anne of Green Garbles"

According to `wc -w`, the resulting novel,
_[Anne of Green Garbles](generated/Anne%20of%20Green%20Garbles.md)_,
comprises 54,638 words.

The software versions used were cPython 3.6.2, with
[beautifulsoup4](https://pypi.org/project/beautifulsoup4/) 4.6.0,
on Ubuntu 16.04.

Theory of Operation
-------------------

The main goal of this project was to see how additional grammatical structure
could be imposed on a [Markov chain][].

The main result is that, if the desired additional structure can be captured by
a [regular grammar][], then, observing that both the regular grammar and the
Markov chain can be thought of as [probabilistic finite automata][] (with the
probabilities in the regular grammar being always either 0% or 100%), we can
apply the standard [product construction][] on these two automata to obtain an
automaton that represents a language which is the intersection of the two
original languages.

As a concrete example, if the Markov chain is obtained from analyzing
_Anne of Green Gables_, and the regular grammar is one which describes
some basic rules of punctuation in English writing, as shown below
(in the form of the corresponding [finite automaton][]):

![Abbreviated diagram of punctuation automaton](images/narration-dialogue-parenthetical.png?raw=true)

...then the resulting automaton is a Markov chain based on the word
frequencies in _Anne of Green Gables_ but restricted to producing
only those strings which correctly follow the given rules of punctuation.

We might call the automaton resulting from this construction a
"tagged Markov chain", because each token is tagged with the state of
the regular grammar in which it was encountered. The construction is
compatible with higher-order Markov chains; the tokens which precede a
token can be thought of as additional tags on that token. Conversely,
the state of the regular grammar could be thought of as a kind of
abstract, "long-distance" higher-order Markov chain.

Just like when a conventional higher-order Markov chain is constructed,
the result of this construction is itself a Markov chain; it retains
the [Markov property][]. This would not be the case if the grammar we
wished to combine with the Markov chain was not a regular grammar
(e.g. if it was a context-free grammar), as we would need to "remember"
how many levels of nesting we had entered previously.

Implementation
--------------

The programs that comprise this generator are written in a style that is amenable
to usage in shell pipelines. However, for normal usage, running them via
the supplied `build.py` script is recommended.

Scripts whose names start with `extract` gather usable information from some
corpus or human-readable source, such as a web page downloaded from
Project Gutenberg, and output it in an intermediate format.

Scripts whose names start with `xform` are transformer filters
which take a file in (usually on `stdin`) and produce another
file (usually on `stdout`); these file are both in intermediate
formats, usually the same intermediate format.

Scripts whose names start with `render` are filters which
take a file in intermediate format and produce a file in
some human-readable output format, such as Markdown.

Some formats used by these tools are:

### HTML

Intended as an input format only, this is any sort of HTML that
BeautifulSoup can make sense of.

### Markdown

Intended as an output format only; as the final step of the pipeline
the text is rendered as Markdown. (It can then be converted to
HTML, etc., by various other tools.)

### tokenstream

An intermediate format.

A text file in UTF-8 encoding, with LF line endings, with one
token per line. A token is a lexeme, a sequence of characters
that we treat as a "unit": a word or a bit of punctuation which
is not part of some other word.

A tokenstream is merely a special case of tagged tokenstream
(see below) where there are no tags on any tokens.

### tagged tokenstream

An intermediate format.

Like a tokenstream, but the token is preceded by a space and
zero or more tags; each tag is a string of the form `A=B`
where A and B can be any tokens which do not contain spaces
or `=`s.

Some meaningful tags are: `state`, the state of the automaton
representing the regular grammar; and `prev1`, the previous
token, for constructing an order-2 Markov chain.

The set of tags in a tagged tokenstream is generally
homongenous: every token in the stream has the same set of tags,
only potentially with different values for each tag.

### Model JSON

An intermediate format.

A JSON file containing a map tagged tokens to maps of
tagged tokens to frequencies.

More Notes
----------

The generator here supports some capabilities that in the end
were not used in the generation of _Anne of Green Garbles_.

The `build.py` script supports being given multiple input
files (HTML documents), and builds the Markov chain model
as if this were a single long input text.

The `xform-tag-with-prev-tokens.py` filter implements an
order-2 Markov chain, by tagging each token with the
token immediately preceding it. This was to confirm that
the intersection construction works with higher-order
Markov chains.

The reason _Anne of Green Garbles_ was produced using only
_Anne of Green Gables_ as its input text, with a Markov chain
of order 1, was for aesthetic purposes -- I found the contrast
of orderly dialogue/narration structure against the decidedly
gibberishy output of the order-1 chain on a single work, to be
more striking. (More [Jess][]-esque, perhaps.)

In addition, the regular grammar used in this generator also
handles parenthetical remarks (as you can see from the diagram
above.) This literary device is, however, used only rarely
in Lucy Maud Montgomery's works. Lucky for us though, Sax
Rohmer wasn't shy about putting in a few parentheses here and
there. So, to demonstrate all three of these features, here is
an excerpt from _The Incoherent Dr. Fu Manchu_, which uses an
order-2 Markov chain on several of Sax Rohmer's works, some
of which do employ parenthetical phrases.

> It wanted only three minutes or more above the nauseating odor
> of burning Indian hemp; and despite the wide, so characteristic
> of the one to arrest with the marked changes (corresponding with
> phases of the one who watched him proved too potent for his
> elusive courage. He wrote on). This was still unreal to me, and
> began very slowly, “This marks a new toy it does make you
> understand?” he declared. “It was not until I realized that he
> hoped to lure us.”
[NaNoGenMo 2019]: https://github.com/NaNoGenMo/2019/
[Beautiful Soup]: https://www.crummy.com/software/BeautifulSoup/
[Markov chain]: https://en.wikipedia.org/wiki/Markov_chain
[Markov property]: https://en.wikipedia.org/wiki/Markov_property
[regular grammar]: https://en.wikipedia.org/wiki/Regular_grammar
[finite automaton]: https://en.wikipedia.org/wiki/Finite-state_machine#Mathematical_model
[probabilistic finite automata]: https://en.wikipedia.org/wiki/Probabilistic_automaton
[product construction]: https://en.wikipedia.org/wiki/File:Intersection1.png
[Jess]: https://whitney.org/collection/works/9517
113 changes: 113 additions & 0 deletions Anne of Green Garbles/build.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
#!/usr/bin/env python3.6

"""
Script to orchestrate building.
"""

from argparse import ArgumentParser
import sys
import os
from subprocess import run


OPTIONS = None


def zrun(targ, command, **kwargs):
if targ is not None:
targ = targ.format(**kwargs)
if os.path.exists(targ):
return
command += ' 2>>{bucket}/stderr.log'
c = command.format(**kwargs)
sys.stderr.write("*** {}\n".format(c))
run(c, shell=True, check=True)


def process_pipeline(bucket, random_seed):
try:
os.mkdir(bucket)
except FileExistsError:
pass

context = dict(
bucket=bucket,
)
zrun(None, 'rm -f {bucket}/stderr.log', **context)
if OPTIONS.clean_gen:
zrun(None, 'rm -f {bucket}/out1.txt {bucket}/out2.txt {bucket}/novel.md {bucket}/novel.html', **context)

instances = []
for num, filename in enumerate(OPTIONS.filenames):
print(filename)
context = dict(
filename=filename,
bucket=bucket,
num=str(num),
fsm=OPTIONS.state_machine,
)
zrun('{bucket}/tokens_{num}_raw.txt', './extract-tokenstream.py "{filename}" > {bucket}/tokens_{num}_raw.txt', **context)
zrun('{bucket}/tokens_{num}_elim.txt', './xform-eliminate.py "{filename}" < {bucket}/tokens_{num}_raw.txt > {bucket}/tokens_{num}_elim.txt', **context)
zrun('{bucket}/tokens_{num}_deduped.txt', './xform-dedup.py < {bucket}/tokens_{num}_elim.txt > {bucket}/tokens_{num}_deduped.txt', **context)
zrun('{bucket}/tokens_{num}_sentences.txt', './xform-fix-fullstops.py < {bucket}/tokens_{num}_deduped.txt > {bucket}/tokens_{num}_sentences.txt', **context)
zrun('{bucket}/tokens_{num}_capitalized.txt', './xform-fix-capitalization.py < {bucket}/tokens_{num}_sentences.txt > {bucket}/tokens_{num}_capitalized.txt', **context)
zrun('{bucket}/tokens_{num}_end-punctuation.txt', './xform-fix-end-punctuation.py < {bucket}/tokens_{num}_capitalized.txt > {bucket}/tokens_{num}_end-punctuation.txt', **context)
zrun('{bucket}/tokens_{num}_contractions.txt', './xform-fix-apostrophes.py < {bucket}/tokens_{num}_end-punctuation.txt > {bucket}/tokens_{num}_contractions.txt', **context)
zrun('{bucket}/tokens_{num}_singlequotes.txt', './xform-eliminate-singlequotes.py < {bucket}/tokens_{num}_contractions.txt > {bucket}/tokens_{num}_singlequotes.txt', **context)
zrun('{bucket}/tokens_{num}_straightquotes.txt', './xform-fix-straightquotes.py < {bucket}/tokens_{num}_singlequotes.txt > {bucket}/tokens_{num}_straightquotes.txt', **context)
zrun('{bucket}/tokens_{num}_clean.txt', './xform-fix-doublequotes.py < {bucket}/tokens_{num}_straightquotes.txt > {bucket}/tokens_{num}_clean.txt', **context)
if OPTIONS.markov_order == 2:
zrun('{bucket}/tokens_{num}_prev1_tagged.txt', './xform-tag-with-prev-tokens.py < {bucket}/tokens_{num}_clean.txt > {bucket}/tokens_{num}_prev1_tagged.txt', **context)
elif OPTIONS.markov_order == 1:
zrun('{bucket}/tokens_{num}_prev1_tagged.txt', 'cat < {bucket}/tokens_{num}_clean.txt > {bucket}/tokens_{num}_prev1_tagged.txt', **context)
else:
raise NotImplementedError(OPTIONS.markov_order)
zrun('{bucket}/tokens_{num}_state_tagged.txt', './xform-tag-with-state.py --state-machine={fsm} < {bucket}/tokens_{num}_prev1_tagged.txt > {bucket}/tokens_{num}_state_tagged.txt', **context)
instances.append(num)

context = dict(
bucket=bucket,
instreams=' '.join(['{bucket}/tokens_{i}_state_tagged.txt'.format(bucket=bucket, i=i) for i in instances]),
random_seed=str(random_seed),
fsm=OPTIONS.state_machine,
title=OPTIONS.title,
)
zrun('{bucket}/model.json', 'cat {instreams} | ./create-model.py > {bucket}/model.json', **context)
zrun('{bucket}/out1.txt', './gen-from-model.py --chapter-count=30 --paragraph-count=35 --title="{title}" --random-seed={random_seed} --state-machine={fsm} {bucket}/model.json > {bucket}/out1.txt', **context)
zrun('{bucket}/out2.txt', './xform-untag.py < {bucket}/out1.txt > {bucket}/out2.txt', **context)
zrun('{bucket}/out3.txt', './xform-remove-short-sentences.py < {bucket}/out2.txt > {bucket}/out3.txt', **context)
zrun('{bucket}/novel.md', './render-markdown.py < {bucket}/out2.txt > {bucket}/novel.md', **context)
zrun(None, 'wc -w {bucket}/novel.md', **context)
zrun('{bucket}/novel.html', 'markdown_py < {bucket}/novel.md > {bucket}/novel.html', **context)
zrun(None, 'firefox {bucket}/novel.html &', **context)


def main(args):
global OPTIONS

argparser = ArgumentParser()
argparser.add_argument('dirname', metavar='DIRNAME', type=str)
argparser.add_argument('filenames', metavar='FILENAME', type=str, nargs='+')

argparser.add_argument("--title", type=str, default="Generated Novel")
argparser.add_argument("--state-machine", type=str, default='dialogue')
argparser.add_argument("--random-seed", type=str, default='9009')
argparser.add_argument("--clean-gen", action='store_true')
argparser.add_argument("--markov-order", type=int, default=1)

OPTIONS = argparser.parse_args(args)

if OPTIONS.random_seed == 'random':
import random
random_seed = random.randint(0, 1000000)
sys.stderr.write(">>> Random seed is: {}".format(random_seed))
else:
random_seed = int(OPTIONS.random_seed)

process_pipeline(OPTIONS.dirname, random_seed)


if __name__ == '__main__':
main(sys.argv[1:])
35 changes: 35 additions & 0 deletions Anne of Green Garbles/create-model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env python3.6

"""
* input format: tagged tokenstream
* output format: Model JSON
Builds a Markov chain model given a (tagged) tokenstream.
"""

import json
import re
import sys


def main(argv):
entries = {}
last = None

for line_no, line in enumerate(sys.stdin):
components = line.strip().split(' ')
word = components[-1]
tags = ' '.join(sorted(components[:-1]))
entry = '{} {}'.format(tags, word)
if last is not None:
m = entries.setdefault(last, {})
m[entry] = m.get(entry, 0) + 1
last = entry

print(json.dumps(entries, sort_keys=True, indent=4))


if __name__ == '__main__':
main(sys.argv)
Loading

0 comments on commit 1d4cdf2

Please sign in to comment.