-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial import of files for _Anne of Green Garbles_.
- Loading branch information
Showing
28 changed files
with
3,524 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
*.json | ||
*.txt | ||
__pycache__/ | ||
aogg/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
Anne of Green Garbles | ||
===================== | ||
|
||
Herein can be found the above-named generated novel, along with the | ||
code for generating it, the both of which were produced for [NaNoGenMo 2019][]. | ||
|
||
The generator is written as a collection of command-line programs | ||
written in Python 3.6. The only Python dependency is | ||
[Beautiful Soup][], used to extract the text from a source HTML file. | ||
|
||
After downloading [the HTML version of _Anne of Green Gables_](http://www.gutenberg.org/files/45/45-h/45-h.htm) | ||
from Project Gutenberg and saving it as `45-h.htm`, the novel | ||
was generated by running the following command: | ||
|
||
./build.py aogg 45-h.htm --random-seed=6010 --markov-order=1 --title "Anne of Green Garbles" | ||
|
||
According to `wc -w`, the resulting novel, | ||
_[Anne of Green Garbles](generated/Anne%20of%20Green%20Garbles.md)_, | ||
comprises 54,638 words. | ||
|
||
The software versions used were cPython 3.6.2, with | ||
[beautifulsoup4](https://pypi.org/project/beautifulsoup4/) 4.6.0, | ||
on Ubuntu 16.04. | ||
|
||
Theory of Operation | ||
------------------- | ||
|
||
The main goal of this project was to see how additional grammatical structure | ||
could be imposed on a [Markov chain][]. | ||
|
||
The main result is that, if the desired additional structure can be captured by | ||
a [regular grammar][], then, observing that both the regular grammar and the | ||
Markov chain can be thought of as [probabilistic finite automata][] (with the | ||
probabilities in the regular grammar being always either 0% or 100%), we can | ||
apply the standard [product construction][] on these two automata to obtain an | ||
automaton that represents a language which is the intersection of the two | ||
original languages. | ||
|
||
As a concrete example, if the Markov chain is obtained from analyzing | ||
_Anne of Green Gables_, and the regular grammar is one which describes | ||
some basic rules of punctuation in English writing, as shown below | ||
(in the form of the corresponding [finite automaton][]): | ||
|
||
![Abbreviated diagram of punctuation automaton](images/narration-dialogue-parenthetical.png?raw=true) | ||
|
||
...then the resulting automaton is a Markov chain based on the word | ||
frequencies in _Anne of Green Gables_ but restricted to producing | ||
only those strings which correctly follow the given rules of punctuation. | ||
|
||
We might call the automaton resulting from this construction a | ||
"tagged Markov chain", because each token is tagged with the state of | ||
the regular grammar in which it was encountered. The construction is | ||
compatible with higher-order Markov chains; the tokens which precede a | ||
token can be thought of as additional tags on that token. Conversely, | ||
the state of the regular grammar could be thought of as a kind of | ||
abstract, "long-distance" higher-order Markov chain. | ||
|
||
Just like when a conventional higher-order Markov chain is constructed, | ||
the result of this construction is itself a Markov chain; it retains | ||
the [Markov property][]. This would not be the case if the grammar we | ||
wished to combine with the Markov chain was not a regular grammar | ||
(e.g. if it was a context-free grammar), as we would need to "remember" | ||
how many levels of nesting we had entered previously. | ||
|
||
Implementation | ||
-------------- | ||
|
||
The programs that comprise this generator are written in a style that is amenable | ||
to usage in shell pipelines. However, for normal usage, running them via | ||
the supplied `build.py` script is recommended. | ||
|
||
Scripts whose names start with `extract` gather usable information from some | ||
corpus or human-readable source, such as a web page downloaded from | ||
Project Gutenberg, and output it in an intermediate format. | ||
|
||
Scripts whose names start with `xform` are transformer filters | ||
which take a file in (usually on `stdin`) and produce another | ||
file (usually on `stdout`); these file are both in intermediate | ||
formats, usually the same intermediate format. | ||
|
||
Scripts whose names start with `render` are filters which | ||
take a file in intermediate format and produce a file in | ||
some human-readable output format, such as Markdown. | ||
|
||
Some formats used by these tools are: | ||
|
||
### HTML | ||
|
||
Intended as an input format only, this is any sort of HTML that | ||
BeautifulSoup can make sense of. | ||
|
||
### Markdown | ||
|
||
Intended as an output format only; as the final step of the pipeline | ||
the text is rendered as Markdown. (It can then be converted to | ||
HTML, etc., by various other tools.) | ||
|
||
### tokenstream | ||
|
||
An intermediate format. | ||
|
||
A text file in UTF-8 encoding, with LF line endings, with one | ||
token per line. A token is a lexeme, a sequence of characters | ||
that we treat as a "unit": a word or a bit of punctuation which | ||
is not part of some other word. | ||
|
||
A tokenstream is merely a special case of tagged tokenstream | ||
(see below) where there are no tags on any tokens. | ||
|
||
### tagged tokenstream | ||
|
||
An intermediate format. | ||
|
||
Like a tokenstream, but the token is preceded by a space and | ||
zero or more tags; each tag is a string of the form `A=B` | ||
where A and B can be any tokens which do not contain spaces | ||
or `=`s. | ||
|
||
Some meaningful tags are: `state`, the state of the automaton | ||
representing the regular grammar; and `prev1`, the previous | ||
token, for constructing an order-2 Markov chain. | ||
|
||
The set of tags in a tagged tokenstream is generally | ||
homongenous: every token in the stream has the same set of tags, | ||
only potentially with different values for each tag. | ||
|
||
### Model JSON | ||
|
||
An intermediate format. | ||
|
||
A JSON file containing a map tagged tokens to maps of | ||
tagged tokens to frequencies. | ||
|
||
More Notes | ||
---------- | ||
|
||
The generator here supports some capabilities that in the end | ||
were not used in the generation of _Anne of Green Garbles_. | ||
|
||
The `build.py` script supports being given multiple input | ||
files (HTML documents), and builds the Markov chain model | ||
as if this were a single long input text. | ||
|
||
The `xform-tag-with-prev-tokens.py` filter implements an | ||
order-2 Markov chain, by tagging each token with the | ||
token immediately preceding it. This was to confirm that | ||
the intersection construction works with higher-order | ||
Markov chains. | ||
|
||
The reason _Anne of Green Garbles_ was produced using only | ||
_Anne of Green Gables_ as its input text, with a Markov chain | ||
of order 1, was for aesthetic purposes -- I found the contrast | ||
of orderly dialogue/narration structure against the decidedly | ||
gibberishy output of the order-1 chain on a single work, to be | ||
more striking. (More [Jess][]-esque, perhaps.) | ||
|
||
In addition, the regular grammar used in this generator also | ||
handles parenthetical remarks (as you can see from the diagram | ||
above.) This literary device is, however, used only rarely | ||
in Lucy Maud Montgomery's works. Lucky for us though, Sax | ||
Rohmer wasn't shy about putting in a few parentheses here and | ||
there. So, to demonstrate all three of these features, here is | ||
an excerpt from _The Incoherent Dr. Fu Manchu_, which uses an | ||
order-2 Markov chain on several of Sax Rohmer's works, some | ||
of which do employ parenthetical phrases. | ||
|
||
> It wanted only three minutes or more above the nauseating odor | ||
> of burning Indian hemp; and despite the wide, so characteristic | ||
> of the one to arrest with the marked changes (corresponding with | ||
> phases of the one who watched him proved too potent for his | ||
> elusive courage. He wrote on). This was still unreal to me, and | ||
> began very slowly, “This marks a new toy it does make you | ||
> understand?” he declared. “It was not until I realized that he | ||
> hoped to lure us.” | ||
[NaNoGenMo 2019]: https://github.com/NaNoGenMo/2019/ | ||
[Beautiful Soup]: https://www.crummy.com/software/BeautifulSoup/ | ||
[Markov chain]: https://en.wikipedia.org/wiki/Markov_chain | ||
[Markov property]: https://en.wikipedia.org/wiki/Markov_property | ||
[regular grammar]: https://en.wikipedia.org/wiki/Regular_grammar | ||
[finite automaton]: https://en.wikipedia.org/wiki/Finite-state_machine#Mathematical_model | ||
[probabilistic finite automata]: https://en.wikipedia.org/wiki/Probabilistic_automaton | ||
[product construction]: https://en.wikipedia.org/wiki/File:Intersection1.png | ||
[Jess]: https://whitney.org/collection/works/9517 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
#!/usr/bin/env python3.6 | ||
|
||
""" | ||
Script to orchestrate building. | ||
""" | ||
|
||
from argparse import ArgumentParser | ||
import sys | ||
import os | ||
from subprocess import run | ||
|
||
|
||
OPTIONS = None | ||
|
||
|
||
def zrun(targ, command, **kwargs): | ||
if targ is not None: | ||
targ = targ.format(**kwargs) | ||
if os.path.exists(targ): | ||
return | ||
command += ' 2>>{bucket}/stderr.log' | ||
c = command.format(**kwargs) | ||
sys.stderr.write("*** {}\n".format(c)) | ||
run(c, shell=True, check=True) | ||
|
||
|
||
def process_pipeline(bucket, random_seed): | ||
try: | ||
os.mkdir(bucket) | ||
except FileExistsError: | ||
pass | ||
|
||
context = dict( | ||
bucket=bucket, | ||
) | ||
zrun(None, 'rm -f {bucket}/stderr.log', **context) | ||
if OPTIONS.clean_gen: | ||
zrun(None, 'rm -f {bucket}/out1.txt {bucket}/out2.txt {bucket}/novel.md {bucket}/novel.html', **context) | ||
|
||
instances = [] | ||
for num, filename in enumerate(OPTIONS.filenames): | ||
print(filename) | ||
context = dict( | ||
filename=filename, | ||
bucket=bucket, | ||
num=str(num), | ||
fsm=OPTIONS.state_machine, | ||
) | ||
zrun('{bucket}/tokens_{num}_raw.txt', './extract-tokenstream.py "{filename}" > {bucket}/tokens_{num}_raw.txt', **context) | ||
zrun('{bucket}/tokens_{num}_elim.txt', './xform-eliminate.py "{filename}" < {bucket}/tokens_{num}_raw.txt > {bucket}/tokens_{num}_elim.txt', **context) | ||
zrun('{bucket}/tokens_{num}_deduped.txt', './xform-dedup.py < {bucket}/tokens_{num}_elim.txt > {bucket}/tokens_{num}_deduped.txt', **context) | ||
zrun('{bucket}/tokens_{num}_sentences.txt', './xform-fix-fullstops.py < {bucket}/tokens_{num}_deduped.txt > {bucket}/tokens_{num}_sentences.txt', **context) | ||
zrun('{bucket}/tokens_{num}_capitalized.txt', './xform-fix-capitalization.py < {bucket}/tokens_{num}_sentences.txt > {bucket}/tokens_{num}_capitalized.txt', **context) | ||
zrun('{bucket}/tokens_{num}_end-punctuation.txt', './xform-fix-end-punctuation.py < {bucket}/tokens_{num}_capitalized.txt > {bucket}/tokens_{num}_end-punctuation.txt', **context) | ||
zrun('{bucket}/tokens_{num}_contractions.txt', './xform-fix-apostrophes.py < {bucket}/tokens_{num}_end-punctuation.txt > {bucket}/tokens_{num}_contractions.txt', **context) | ||
zrun('{bucket}/tokens_{num}_singlequotes.txt', './xform-eliminate-singlequotes.py < {bucket}/tokens_{num}_contractions.txt > {bucket}/tokens_{num}_singlequotes.txt', **context) | ||
zrun('{bucket}/tokens_{num}_straightquotes.txt', './xform-fix-straightquotes.py < {bucket}/tokens_{num}_singlequotes.txt > {bucket}/tokens_{num}_straightquotes.txt', **context) | ||
zrun('{bucket}/tokens_{num}_clean.txt', './xform-fix-doublequotes.py < {bucket}/tokens_{num}_straightquotes.txt > {bucket}/tokens_{num}_clean.txt', **context) | ||
if OPTIONS.markov_order == 2: | ||
zrun('{bucket}/tokens_{num}_prev1_tagged.txt', './xform-tag-with-prev-tokens.py < {bucket}/tokens_{num}_clean.txt > {bucket}/tokens_{num}_prev1_tagged.txt', **context) | ||
elif OPTIONS.markov_order == 1: | ||
zrun('{bucket}/tokens_{num}_prev1_tagged.txt', 'cat < {bucket}/tokens_{num}_clean.txt > {bucket}/tokens_{num}_prev1_tagged.txt', **context) | ||
else: | ||
raise NotImplementedError(OPTIONS.markov_order) | ||
zrun('{bucket}/tokens_{num}_state_tagged.txt', './xform-tag-with-state.py --state-machine={fsm} < {bucket}/tokens_{num}_prev1_tagged.txt > {bucket}/tokens_{num}_state_tagged.txt', **context) | ||
instances.append(num) | ||
|
||
context = dict( | ||
bucket=bucket, | ||
instreams=' '.join(['{bucket}/tokens_{i}_state_tagged.txt'.format(bucket=bucket, i=i) for i in instances]), | ||
random_seed=str(random_seed), | ||
fsm=OPTIONS.state_machine, | ||
title=OPTIONS.title, | ||
) | ||
zrun('{bucket}/model.json', 'cat {instreams} | ./create-model.py > {bucket}/model.json', **context) | ||
zrun('{bucket}/out1.txt', './gen-from-model.py --chapter-count=30 --paragraph-count=35 --title="{title}" --random-seed={random_seed} --state-machine={fsm} {bucket}/model.json > {bucket}/out1.txt', **context) | ||
zrun('{bucket}/out2.txt', './xform-untag.py < {bucket}/out1.txt > {bucket}/out2.txt', **context) | ||
zrun('{bucket}/out3.txt', './xform-remove-short-sentences.py < {bucket}/out2.txt > {bucket}/out3.txt', **context) | ||
zrun('{bucket}/novel.md', './render-markdown.py < {bucket}/out2.txt > {bucket}/novel.md', **context) | ||
zrun(None, 'wc -w {bucket}/novel.md', **context) | ||
zrun('{bucket}/novel.html', 'markdown_py < {bucket}/novel.md > {bucket}/novel.html', **context) | ||
zrun(None, 'firefox {bucket}/novel.html &', **context) | ||
|
||
|
||
def main(args): | ||
global OPTIONS | ||
|
||
argparser = ArgumentParser() | ||
argparser.add_argument('dirname', metavar='DIRNAME', type=str) | ||
argparser.add_argument('filenames', metavar='FILENAME', type=str, nargs='+') | ||
|
||
argparser.add_argument("--title", type=str, default="Generated Novel") | ||
argparser.add_argument("--state-machine", type=str, default='dialogue') | ||
argparser.add_argument("--random-seed", type=str, default='9009') | ||
argparser.add_argument("--clean-gen", action='store_true') | ||
argparser.add_argument("--markov-order", type=int, default=1) | ||
|
||
OPTIONS = argparser.parse_args(args) | ||
|
||
if OPTIONS.random_seed == 'random': | ||
import random | ||
random_seed = random.randint(0, 1000000) | ||
sys.stderr.write(">>> Random seed is: {}".format(random_seed)) | ||
else: | ||
random_seed = int(OPTIONS.random_seed) | ||
|
||
process_pipeline(OPTIONS.dirname, random_seed) | ||
|
||
|
||
if __name__ == '__main__': | ||
main(sys.argv[1:]) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
#!/usr/bin/env python3.6 | ||
|
||
""" | ||
* input format: tagged tokenstream | ||
* output format: Model JSON | ||
Builds a Markov chain model given a (tagged) tokenstream. | ||
""" | ||
|
||
import json | ||
import re | ||
import sys | ||
|
||
|
||
def main(argv): | ||
entries = {} | ||
last = None | ||
|
||
for line_no, line in enumerate(sys.stdin): | ||
components = line.strip().split(' ') | ||
word = components[-1] | ||
tags = ' '.join(sorted(components[:-1])) | ||
entry = '{} {}'.format(tags, word) | ||
if last is not None: | ||
m = entries.setdefault(last, {}) | ||
m[entry] = m.get(entry, 0) + 1 | ||
last = entry | ||
|
||
print(json.dumps(entries, sort_keys=True, indent=4)) | ||
|
||
|
||
if __name__ == '__main__': | ||
main(sys.argv) |
Oops, something went wrong.