HTML processing filters and utilities. This module is a small framework for defining CSS-based content filters and applying them to user provided content.
Although this project was started at GitHub, they no longer use it. This gem must be considered standalone and independent from GitHub.
- HTML-Pipeline
Add this line to your application's Gemfile:
gem 'html-pipeline'
And then execute:
$ bundle
Or install it by yourself as:
$ gem install html-pipeline
This library provides a handful of chainable HTML filters to transform user content into HTML markup. Each filter does some work, and then hands off the results tothe next filter. A pipeline has several kinds of filters available to use:
- Multiple
TextFilter
s, which operate a UTF-8 string - A
ConvertFilter
filter, which turns text into HTML (eg., Commonmark/Asciidoc -> HTML) - A
SanitizationFilter
, which remove dangerous/unwanted HTML elements and attributes - Multiple
NodeFilter
s, which operate on a UTF-8 HTML document
You can assemble each sequence into a single pipeline, or choose to call each filter individually.
As an example, suppose we want to transform Commonmark source text into Markdown HTML:
Hey there, @gjtorikian
With the content, we also want to:
- change every instance of
Hey
toHello
- strip undesired HTML
- linkify @mention
We can construct a pipeline to do all that like this:
require 'html_pipeline'
class HelloJohnnyFilter < HTMLPipelineFilter
def call
text.gsub("Hey", "Hello")
end
end
pipeline = HTMLPipeline.new(
text_filters: [HelloJohnnyFilter.new]
convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
# note: next line is not needed as sanitization occurs by default;
# see below for more info
sanitization_config: HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG,
node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text) # recommended: can call pipeline over and over
Filters can be custom ones you create (like HelloJohnnyFilter
), and HTMLPipeline
additionally provides several helpful ones (detailed below). If you only need a single filter, you can call one individually, too:
filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new
filter.call(text)
Filters combine into a sequential pipeline, and each filter hands its output to the next filter's input. Text filters are processed first, then the convert filter, sanitization filter, and finally, the node filters.
Some filters take optional context
and/or result
hash(es). These are
used to pass around arguments and metadata between filters in a pipeline. For
example, if you want to disable footnotes in the MarkdownFilter
, you can pass an option in the context hash:
context = { markdown: { extensions: { footnotes: false } } }
filter = HTMLPipeline::ConvertFilter::MarkdownFilter.new(context: context)
filter.call("Hi **world**!")
Alternatively, you can construct a pipeline, and pass in a context during the call:
pipeline = HTMLPipeline.new(
convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
node_filters: [HTMLPipeline::NodeFilter::MentionFilter.new]
)
pipeline.call(user_supplied_text, context: { markdown: { extensions: { footnotes: false } } })
Please refer to the documentation for each filter to understand what configuration options are available.
Different pipelines can be defined for different parts of an app. Here are a few paraphrased snippets to get you started:
# The context hash is how you pass options between different filters.
# See individual filter source for explanation of options.
context = {
asset_root: "http://your-domain.com/where/your/images/live/icons",
base_url: "http://your-domain.com"
}
# Pipeline used for user provided content on the web
MarkdownPipeline = HTMLPipeline.new (
text_filters: [HTMLPipeline::TextFilter::ImageFilter.new],
convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
node_filters: [
HTMLPipeline::NodeFilter::HttpsFilter.new,HTMLPipeline::NodeFilter::MentionFilter.new,
], context: context)
# Pipelines aren't limited to the web. You can use them for email
# processing also.
HtmlEmailPipeline = HTMLPipeline.new(
text_filters: [
PlainTextInputFilter.new,
ImageFilter.new
], {})
TextFilter
s must define a method named call
which is called on the text. @text
, @config
, and @result
are available to use, and any changes made to these ivars are passed on to the next filter.
ImageFilter
- converts imageurl
into<img>
tagPlainTextInputFilter
- html escape text and wrap the result in a<div>
The ConvertFilter
takes text and turns it into HTML. @text
, @config
, and @result
are available to use. ConvertFilter
must defined a method named call
, taking one argument, text
. call
must return a string representing the new HTML document.
MarkdownFilter
- creates HTML from text using Commonmarker
Because the web can be a scary place, HTML is automatically sanitized after the ConvertFilter
runs and before the NodeFilter
s are processed. This is to prevent malicious or unexpected input from entering the pipeline.
The sanitization process takes a hash configuration of settings. See the Selma documentation for more information on how to configure these settings. Note that users must correctly configure the sanitization configuration if they expect to use it correctly in conjunction with handlers which manipulate HTML.
A default sanitization config is provided by this library (HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG
). A sample custom sanitization allowlist might look like this:
ALLOWLIST = {
elements: ["p", "pre", "code"]
}
pipeline = HTMLPipeline.new \
text_filters: [
HTMLPipeline::TextFilter::ImageFilter.new,
],
convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
sanitization_config: ALLOWLIST
result = pipeline.call <<-CODE
This is *great*:
some_code(:first)
CODE
result[:output].to_s
This would print:
<p>This is great:</p>
<pre><code>some_code(:first)
</code></pre>
Sanitization can be disabled if and only if nil
is explicitly passed as
the config:
pipeline = HTMLPipeline.new \
text_filters: [
HTMLPipeline::TextFilter::ImageFilter.new,
],
convert_filter: HTMLPipeline::ConvertFilter::MarkdownFilter.new,
sanitization_config: nil
For more examples of customizing the sanitization process to include the tags you want, check out the tests and the FAQ.
NodeFilters
s can operate either on HTML elements or text nodes using CSS selectors. Each NodeFilter
must define a method named selector
which provides an instance of Selma::Selector
. If elements are being manipulated, handle_element
must be defined, taking one argument, element
; if text nodes are being manipulated, handle_text_chunk
must be defined, taking one argument, text_chunk
. @config
, and @result
are available to use, and any changes made to these ivars are passed on to the next filter.
NodeFilter
also has an optional method, after_initialize
, which is run after the filter initializes. This can be useful in setting up a fresh custom state for result
to start from each time the pipeline is called.
Here's an example NodeFilter
that adds a base url to images that are root relative:
require 'uri'
class RootRelativeFilter < HTMLPipeline::NodeFilter
SELECTOR = Selma::Selector.new(match_element: "img")
def selector
SELECTOR
end
def handle_element(img)
next if img['src'].nil?
src = img['src'].strip
if src.start_with? '/'
img["src"] = URI.join(context[:base_url], src).to_s
end
end
end
For more information on how to write effective NodeFilter
s, refer to the provided filters, and see the underlying lib, Selma for more information.
AbsoluteSourceFilter
: replace relative image urls with fully qualified versionsAssetProxyFilter
: replace image links with an encoded link to an asset serverEmojiFilter
: converts:<emoji>:
to emoji- (Note: the included
MarkdownFilter
will already convert emoji)
- (Note: the included
HttpsFilter
: Replacing http urls with https versionsImageMaxWidthFilter
: link to full size image for large imagesMentionFilter
: replace@user
mentions with linksSanitizationFilter
: allow sanitize user markupSyntaxHighlightFilter
: applies syntax highlighting topre
blocks- (Note: the included
MarkdownFilter
will already apply highlighting)
- (Note: the included
TableOfContentsFilter
: anchor headings with name attributes and generate Table of Contents html unordered list linking headingsTeamMentionFilter
: replace@org/team
mentions with links
Since filters can be customized to your heart's content, gem dependencies are not bundled; this project doesn't know which of the default filters you might use, and as such, you must bundle each filter's gem dependencies yourself.
For example, SyntaxHighlightFilter
uses rouge
to detect and highlight languages; to use the SyntaxHighlightFilter
, you must add the following to your Gemfile:
gem "rouge"
Note See the Gemfile
:test
group for any version requirements.
When developing a custom filter, call HTMLPipeline.require_dependency
at the start to ensure that the local machine has the necessary dependency. You can also use HTMLPipeline.require_dependencies
to provide a list of dependencies to check.
On a similar note, you must manually require whichever filters you desire:
require "html_pipeline" # must be included
require "html_pipeline/convert_filter/markdown_filter" # included because you want to use this filter
require "html_pipeline/node_filter/mention_filter" # included because you want to use this filter
Full reference documentation can be found here.
Filters and Pipelines can be set up to be instrumented when called. The pipeline
must be setup with an
ActiveSupport::Notifications
compatible service object and a name. New pipeline objects will default to the
HTMLPipeline.default_instrumentation_service
object.
# the AS::Notifications-compatible service object
service = ActiveSupport::Notifications
# instrument a specific pipeline
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline", service
# or set default instrumentation service for all new pipelines
HTMLPipeline.default_instrumentation_service = service
pipeline = HTMLPipeline.new [MarkdownFilter], context
pipeline.setup_instrumentation "MarkdownPipeline"
Filters are instrumented when they are run through the pipeline. A
call_filter.html_pipeline
event is published once any filter finishes; call_text_filters
and call_node_filters
is published when all of the text and node filters are finished, respectively.
The payload
should include the filter
name. Each filter will trigger its own
instrumentation call.
service.subscribe "call_filter.html_pipeline" do |event, start, ending, transaction_id, payload|
payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
payload[:filter] #=> "MarkdownFilter"
payload[:context] #=> context Hash
payload[:result] #=> instance of result class
payload[:result][:output] #=> output HTML String
end
The full pipeline is also instrumented:
service.subscribe "call_text_filters.html_pipeline" do |event, start, ending, transaction_id, payload|
payload[:pipeline] #=> "MarkdownPipeline", set with `setup_instrumentation`
payload[:filters] #=> ["MarkdownFilter"]
payload[:doc] #=> HTML String
payload[:context] #=> context Hash
payload[:result] #=> instance of result class
payload[:result][:output] #=> output HTML String
end
To make a pipeline work on a plain text document, put the PlainTextInputFilter
at the end of your text_filter
s config . This will wrap the content in a div
so the filters have a root element to work with. If you're passing in an HTML fragment,
but it doesn't have a root element, you can wrap the content in a div
yourself.
HTMLPipeline::SanitizationFilter::ALLOWLIST
is the default allowlist used if no sanitization_config
argument is given. The default is a good starting template for
you to add additional elements. You can either modify the constant's value, or
re-define your own config and pass that in, such as:
config = HTMLPipeline::SanitizationFilter::DEFAULT_CONFIG.deep_dup
config[:elements] << "iframe" # sure, whatever you want
Thanks to all of these contributors.
This project is a member of the OSS Manifesto.