-
Notifications
You must be signed in to change notification settings - Fork 7
pandoc_filters
Table of contents
When transcribing a document with pandoc, filters can be used to change input values to different output values. For example, rmarkdown within R uses filters to insert page breaks and create custom blocks.
An example of how lua filters can be used is provided in the rmarkdown package
vignette("lua-filters", package = "rmarkdown")
Pandoc also provides an example file that can be used out of the box in a call to pandoc or manipulated for your own purposes.
pandoc --print-default-data-file sample.lua
Another example that may be more familiar is the use of pandoc-citeproc
in the typical command line call to pandoc that facilitates creating references within a document. Converting the reference tags to the appropriate reference notation given their type and location is done using filters. It is just that this process has been highly used and iteratively refined such that there are very few bugs and users are more familiar with the process than for other filters.
When converting documents, pandoc creates an abstract syntax tree (AST), which is basically a json structure. For example, a markdown document called ast.md with a single header and a single line of text will have one header and one paragraph in the AST. You can see this structure using JSON output, e.g., pandoc -f markdown -t json -o ast.json ast.md
and the resulting json file can be read into R using
xfun::tree(jsonlite::fromJSON('ast.json, simplifyVector = FALSE))
The AST will be represented as a recursive list. Knowing about the AST will be helpful when trying to use filters because filters modify the AST. Manipulation of the AST is more robust than regex because it manipulates structured data rather than text. For example, regex can pick up a hash tag in a sentence rather than just when they are used to specify a header whereas using structures would only look at headers no matter where the hash symbol occurs in a document.
Each element of the AST has a label. For example, a table will be labelled as such, "Table". Another common label is "RawBlock" where the structure has two elements, format
and content
. Typical formats of a RawBlock include "latex" and "html". To see more details about available structures within the AST see the AST pandoc documentation.
Pandoc has built-in functions to manipulate these structures. For example, is_block
will return TRUE if the structure is a block element.
Pandoc uses a scripting language called lua to filter the latex output. Lua is used within pandoc because it is incredibly fast and it is portable. Lua is distributed in a small package and as an 'out-of-the-box' build that will work on all platforms that have a standard C compiler, even running on mobile devices. More notably, lua can be used within applications and has strong integration with other code languages.
Everything necessary to perform the filtering comes with the installation of pandoc, but more functionality can be utilized if you install lua yourself on your system. For example, if you have errors in the syntax of your filters pandoc will not provide help on how to fix the errors but lua will. Running the first line will show the errors where the second line does not show these helpful warnings.
lua sample.lua
pandoc -t sample.lua
Pandoc filters contain functions written in lua that manipulate portions of the text based on the AST. Thus, each main function is named after an element or label available in the AST. For example, a function called Para will manipulate paragraphs and Header will manipulate headers.
Read this article for more information on why one should use lua.