Converting from bleach to nh3
Bleach is deprecated, here's how to come close to replicating bleach.clean()
using the nh3 version of .clean()
.
import nh3
def clean_string(string: str) -> str:
# The arguments below being passed to `nh3.clean()` are
# the default values of the `bleach.clean()` function.
return nh3.clean(
string,
tags={
"a",
"abbr",
"acronym",
"b",
"blockquote",
"code",
"em",
"i",
"li",
"ol",
"strong",
"ul",
},
attributes={
"a": {"href", "title"},
"abbr": {"title"},
"acronym": {"title"},
},
url_schemes={"http", "https", "mailto"},
link_rel=None,
)
The big difference is unlike the safing of HTML done by bleach, nh3 removes the offending tags altogether. Read the comments below to see what this means.
Results:
>>> input_from_user = """<b>
<img src="">
I\'m not trying to XSS you <a href="https://example.com">Link</a>
</b>"""
>>>
>>> # By default, bleach version safes the HTML
>>> # rather than remove the tags altogether.
>>> bleach.clean(input_from_user)
'<b><img src="">I\'m not trying to XSS you <a href="https://example.com">Link</a></b>'
>>>
>>> # In contrast, nh3 removes the offending tags entirely
>>> # while also preserving whitespace.
>>> clean_string(input_from_user)
'<b>\n\nI\'m not trying to XSS you <a href="https://example.com">Link</a>\n</b>'
Advantages of switching to nh3 are:
- nh3 is actively maintained, bleach is officially deprecated.
- I believe the nh3 technique of stripping tags rather than allowing safing is more secure. The idea of safing is great, but I've always wondered if a creative attacker could find a way to exploit it. So I think it is better to remove the offending tags altogether.
- The preservation of whitespace is really useful for preserving content submitted in a textarea. This is especially true for Markdown content.
- nh3 is a binding to the rust-ammonia project. They claim a 15x speed increase over bleach's binding to the html5lib project. Even if that is a 3x exaggeration, that's still a 5x speed increase.
Tags: howto python rust-lang