Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Allow multiple inlined image data links in html clean #5

Conversation

beledouxdenis
Copy link
Contributor

Add a lazy quantifier in the regex _find_image_dataurls to match as few characters as possible,
to make it stop at the first occurence of ;base64,

e.g.

>>> _find_image_dataurls = re.compile(r'data:image/(.+);base64,', re.I).findall
>>> _find_image_dataurls('<div style="background: url(); background-image: url();"></div>')
['jpeg;base64,foo); background-image: url(data:image/jpeg']
>>> _find_image_dataurls = re.compile(r'data:image/(.+?);base64,', re.I).findall
>>> _find_image_dataurls('<div style="background: url(); background-image: url();"></div>')
['jpeg', 'jpeg']

This allows to have multiple image data links on the same line, which happens for instance in inline styles.

Without this change, _has_javascript_scheme returns True because the count of safe image urls is lower than the number of possible malicious scheme.
Then, the whole style is dropped as considered malicious.

Add a lazy quantifier in the regex `_find_image_dataurls`
to match as few characters as possible,
to make it stop at the first occurence of `;base64,`

e.g.
```py
>>> _find_image_dataurls = re.compile(r'data:image/(.+);base64,', re.I).findall
>>> _find_image_dataurls('<div style="background: url(); background-image: url();"></div>')
['jpeg;base64,foo); background-image: url(data:image/jpeg']
```

```py
>>> _find_image_dataurls = re.compile(r'data:image/(.+?);base64,', re.I).findall
>>> _find_image_dataurls('<div style="background: url(); background-image: url();"></div>')
['jpeg', 'jpeg']
```

This allows to have multiple image data links on the same line,
which happens for instance in inline styles.

Without this change, `_has_javascript_scheme` returns `True`
because the count of safe image urls is lower than the number of
possible malicious scheme.
Then, the whole style is dropped as considered malicious.

Co-authored-by: Christophe Simonis <chs@odoo.com>
@frenzymadness
Copy link
Member

Thank you for your contribution. I'll look at this closely but not sooner than next week.

@frenzymadness
Copy link
Member

I like this change. The problem in the CI is unrelated to this PR. I'm gonna try to fix it and then merge the PR. Do you need a new release?

@beledouxdenis
Copy link
Contributor Author

The problem in the CI is unrelated to this PR. I'm gonna try to fix it and then merge the PR.

No problem, take your time :)

Do you need a new release?

We do not need. We use lxml 4.8.0 and we cannot, in our stable releases, suddenly bump the version to a new release neither we can suddenly add a new dependency to this lxml_html_clean lib, so we are screwed anyway. We will most-likely monkey-patch this locally in our project.

@frenzymadness
Copy link
Member

In the end, no hotfix on our side is needed. Thank you once more.

@frenzymadness frenzymadness merged commit 97402b5 into fedora-python:main Apr 5, 2024
13 of 14 checks passed
@beledouxdenis
Copy link
Contributor Author

Thanks 😚

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants