-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider switching from lxml's clean_html for enhanced security (and possibly performance) #209
Comments
Thanks again @frenzymadness , it looks as if the only usage of cleaner here is in relation with html_text library, so we could actually remove the usage of cleaner here and solve the issue in html_text in TeamHG-Memex/html-text#30 extruct/extruct/w3cmicrodata.py Lines 249 to 250 in 6053812
|
Could/should this be a separate component on PyPI that just wraps all of the methods with docs on the default? |
Who builds LXML and nh3? With e.g. cibuildwheel? And are there gpg signatures of the package archive and its manifest of installable package files' checksums? |
The HTML clean functionality is now a separate project on PyPI: https://pypi.org/project/lxml-html-clean/ Next step is to make lxml itself use it.
LXML uses cibuildwheel in its github workflows. nh3 also provides wheels but I'm not sure how they build them.
I don't know. |
Thx.
There's also https://github.com/mozilla/bleach
A comparison and merging of test cases might also be good.
…On Tue, Feb 27, 2024, 3:08 AM frenzymadness ***@***.***> wrote:
Could/should this be a separate component on PyPI that just wraps all of
the methods with docs on the default?
The HTML clean functionality is now a separate project on PyPI:
https://pypi.org/project/lxml-html-clean/ Next step is to make lxml
itself use it.
Who builds LXML and nh3?
LXML uses cibuildwheel in its github workflows. nh3 also provides wheels
but I'm not sure how they build them.
And are there gpg signatures of the package archive and its manifest of
installable package files' checksums?
I don't know.
—
Reply to this email directly, view it on GitHub
<#209 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMNS7HQ44LOSFZMOXQEELYVWH7DAVCNFSM6AAAAAA4EEIDQCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRVHE4TEMJTG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Yes, but it's deprecated and in maintenance-only mode (which might be enough). |
Just an update on this. The latest version of If you want to continue using it, you can either depend on |
I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.
The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.
Two viable alternatives worth considering are
bleach
andnh3
. Here's why:bleach:
nh3:
We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.
Let me know if we can help you with this transition anyhow and have a nice day.
The text was updated successfully, but these errors were encountered: