include-doc.lua
is a Lua filter for Pandoc to include the contents
of some external documents inside a main document.
Consider the file master.html
, that you can find in the test
directory:
<html>
<head>
<title>Assembled document</title>
</head>
<body>
<h1>Title</h1>
<p>
This is a master document that includes some parts from other documents.
</p>
<div class="include-doc" data-include-format="html" data-include-src="chap1.html">
<p>This text will be replaced by the contents of "chap1.html".</p>
</div>
<div class="include-doc" data-include-format="json" data-include-src="chap2.json">
<p>This text will be replaced by the contents of "chap2.json".</p>
</div>
<div class="include-doc" data-include-format="markdown" data-include-src="chap3.md">
<p>This text will be replaced by the contents of "chap3.md".</p>
</div>
<div class="include-doc" data-include-format="markdown" data-include-src="chap4.md">
<p>This text will be replaced by the contents of "chap4.md".</p>
</div>
<div class="include-doc" data-include-format="markdown" data-include-src="chap5.md">
<p>This text will be replaced by the contents of "chap5.md".</p>
</div>
</body>
</html>
and convert it into markdown filtering it with include-doc.lua
this way:
pandoc -f html -t markdown -s -L include-doc.lua master.html
(if you run it from the test
directory you may need to write -L ../src/include-doc.lua
instead of -L include-doc.lua
)
The resulting document will embed the contents of chap1.html
, chap2.json
,
chap3.md
, chap4.md
(with its chap4s1.html
and chap4s2.json
sub-documents)
and chap5.md
.
include-doc.lua
looks for Pandoc Div
elements with an include-doc
class.
When it finds such elements, it fetches the contents of the source specified in the
data-include-src
attribute and calls pandoc.read
to read its contents with the format specified by the data-include-format
attribute.
Then it replaces the contents of the Div
element with the contents of the external source
(in Pandoc terms, the Div
blocks are replaced by the blocks of the imported document).
For the replacement to succeed, these elements are mandatory:
-
the(see less clutter below)include-doc
class -
the(see less clutter below)include-format
attribute (it'sdata-include-format
in HTML) -
the
include-src
attribute (it'sdata-include-src
in HTML)
If something goes wrong -- i.e. the source can't be found or the format is invalid -- the
replacement does not happen and the Div
element retains its contents.
You can specify inclusions in every format that lets you define a Pandoc
Div
with the include-doc
class and the include-format
and include-src
attributes.
So the including documents are restricted to those formats.
The documents that can be included, on the other hand, are all the ones that can be imported in Pandoc.
The including Div
block is kept, only its contents are replaced.
After a successful replacement,
-
the
included
class is added to theDiv
's classes -
the
include-sha1
attribute is added to theDiv
's attributes (that attribute is not used now; future versions may use its value to detect changes in the imported document; by the way, the attribute value is the SHA-1 of the imported document contents)
The filter is recursive, so the whole document structure can have an arbitrary depth.
It should also detect closed loops in the document structure: i.e. document1 includes document2, document2 includes document3 and document3 includes document1; that would determine an infinite loop of inclusions (circular references), but once the filter detects a closed loop, it stops. Until then the inclusion process is done anyway.
The filter does not mind if you include the same document more than once, as long as there are no closed loops.
Pandoc lets you apply more than one filter, so you may first apply the include-doc
filter
to assemble the whole document and then pass it to another filter.
The command:
pandoc -f html -t markdown -s -L include-doc.lua -L other-filter.lua -o whole-filtered-doc.md master.html
produces a whole-filtered-doc.md
markdown document; master.html
is the master document that specifies
inclusions, and other-filter.lua
is the filter to apply to the document once its assembled.
This filter adds these metadata (they are all MetaString
):
-
root_id
: the identifier of the root document -
root_format
: the format of the root document -
root_src
: the source of the root document -
root_sha1
: the SHA1 of the the root document's contents
You can also include the sub-documents' metadata. There are two ways:
-
adding the class
include-meta
to theDiv
elements used to include sub-documents -
adding
include-sub-meta: true
to the main document's metadata -
setting the
include_sub_meta
variable with--variable include_sub_meta
in pandoc command line
The first method lets you import metadata selectively for each sub-document.
The second one makes the filter store every sub-document's metadata in the resulting doc.
All the sub-document's metadata are stored under the included-sub-meta
key.
You need to specify the -s
option of pandoc
, otherwise you won't get any metadata
(but this is not specific to this filter).
Here's an example (see master-include-all-meta.md
in the test
directory):
---
include-sub-meta: true
included-sub-meta:
- chap1:
src: chap1.html
title: Chapter 1
- chap2:
src: chap2.json
title: Chapter 2
- chap3:
src: chap3.md
title: Chapter 3
- chap4:
src: chap4.md
title: Fourth chapter
- chap5:
src: chap5.md
title: Chapter 5
- chap4s1:
src: chap4s1.html
title: Fourth chapter, section one
- chap4s2:
src: chap4s2.json
title: Fourth chapter, section two
title: Assembled document
---
# Title
This is a master document that includes some parts from other documents.
::: {.include-doc .included include-format="html" include-src="chap1.html" include-sha1="011822fbb02463dc05c2f35d8d7066f3ee320c5a" included-id="chap1"}
## Chapter 1
This is the first chapter.
:::
(the output is cut to show only the first part of the document)
As you can see, every sub-document's metadata is under a sub-key of included-sub-meta
.
The sub-key is an identifier assigned by the filter to each imported document.
The same value is stored in the included-id
attribute of the Div
, whose
contents have been replaced by the sub-document.
The sub-document metadata are complemented with a src
field reporting its source.
Since version 0.3, you can specify only the include-src
attribute, and the Div
will be
considered an "inclusion Div", as if it had the include-doc
class and the include-format
attribute.
Unless you specify it, the source format will be guessed from its path calling from_path (available since version 3.1.2 of Pandoc).
If the format is not identified, the source contents will not be included in the output.
The include-doc
class will be added, if not present.
inclusion-tree.lua
is a custom writer and it's used to retrieve the tree structure
of a document as JSON.
It's tipically used in conjunction with include-doc.lua
, e.g.:
pandoc -f markdown -t inclusion-tree.lua -L include-doc.lua document-including-other-docs.md
It ouputs a JSON object like this:
{
"children": [
...
],
"format": "html",
"id": "root",
"sha1": "64b297779956f64503df8dccca191f76403462f0",
"src": "master.html"
}
The children
field is an array of objects with the same format
, id
, sha1
and src
fields.
If the main document's direct children include other documents as well, they'll have a children
field, otherwise they won't have it.
You can set the root document id
with the -M root-id=...
(or --metadata root-id=...
) option,
e.g.:
pandoc -f markdown -t inclusion-tree.lua -L include-doc.lua -M root-id=master-doc document-including-other-docs.md
You'll get a JSON like this:
{
"children": [
...
],
"format": "markdown",
"id": "master-doc",
"sha1": "...",
"src": "document-including-other-docs.md"
}
The current version is 0.4.4 (2024, April 24th).
This software
-
is a filter for Pandoc;
-
and makes use of William Lupton's logging.lua module.