Skip to content

Commit

Permalink
FIX: Allow multiple inlined image data links in html clean
Browse files Browse the repository at this point in the history
Add a lazy quantifier in the regex `_find_image_dataurls`
to match as few characters as possible,
to make it stop at the first occurence of `;base64,`

e.g.
```py
>>> _find_image_dataurls = re.compile(r'data:image/(.+);base64,', re.I).findall
>>> _find_image_dataurls('<div style="background: url(); background-image: url();"></div>')
['jpeg;base64,foo); background-image: url(data:image/jpeg']
```

```py
>>> _find_image_dataurls = re.compile(r'data:image/(.+?);base64,', re.I).findall
>>> _find_image_dataurls('<div style="background: url(); background-image: url();"></div>')
['jpeg', 'jpeg']
```

This allows to have multiple image data links on the same line,
which happens for instance in inline styles.

Without this change, `_has_javascript_scheme` returns `True`
because the count of safe image urls is lower than the number of
possible malicious scheme.
Then, the whole style is dropped as considered malicious.

Co-authored-by: Christophe Simonis <chs@odoo.com>
  • Loading branch information
2 people authored and frenzymadness committed Apr 5, 2024
1 parent 2dfd5ac commit 97402b5
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 1 deletion.
2 changes: 1 addition & 1 deletion lxml_html_clean/clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
# All kinds of schemes besides just javascript: that can cause
# execution:
_find_image_dataurls = re.compile(
r'data:image/(.+);base64,', re.I).findall
r'data:image/(.+?);base64,', re.I).findall
_possibly_malicious_schemes = re.compile(
r'(javascript|jscript|livescript|vbscript|data|about|mocha):',
re.I).findall
Expand Down
25 changes: 25 additions & 0 deletions tests/test_clean.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,31 @@ def test_image_data_links_in_style(self):
cleaned,
"%s -> %s" % (url, cleaned))

def test_image_data_links_in_inline_style(self):
safe_attrs = set(lxml.html.defs.safe_attrs)
safe_attrs.add('style')

cleaner = Cleaner(
safe_attrs_only=True,
safe_attrs=safe_attrs)

data = b'123'
data_b64 = base64.b64encode(data).decode('ASCII')
url = "url(data:image/jpeg;base64,%s)" % data_b64
styles = [
"background: %s" % url,
"background: %s; background-image: %s" % (url, url),
]
for style in styles:
html = '<div style="%s"></div>' % style
s = lxml.html.fragment_fromstring(html)

cleaned = lxml.html.tostring(cleaner.clean_html(s))
self.assertEqual(
html.encode("UTF-8"),
cleaned,
"%s -> %s" % (style, cleaned))

def test_formaction_attribute_in_button_input(self):
# The formaction attribute overrides the form's action and should be
# treated as a malicious link attribute
Expand Down

0 comments on commit 97402b5

Please sign in to comment.