You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I'm developing spam filters, and have to parse html emails to plain text to analyze. I've used html2text and later my own simplified implementation, but inscriptis looks even better!
Is it possible to implement optional filtering/ignoring of hidden text parts? Text written using very small font size or font color equal (or close to) background color... sometimes this is defined in css/style tags, sometimes in span tag's parameters.
This technique is often used on webpages and spam emails to fool search engines and spam filters with fake content not visible to human viewers.
we could add functions that interpret stylesheet options prior to applying them (e.g., set the text to invisible, if its size or color are below a certain threshold).
limitations:
we probably wouldn't activate these functions per default
spammers could adapt (e.g., by using stylesheets rather than the style attribute).
would such a feature be helpful for your application?
Hi! I'm developing spam filters, and have to parse html emails to plain text to analyze. I've used html2text and later my own simplified implementation, but inscriptis looks even better!
Is it possible to implement optional filtering/ignoring of hidden text parts? Text written using very small font size or font color equal (or close to) background color... sometimes this is defined in css/style tags, sometimes in span tag's parameters.
This technique is often used on webpages and spam emails to fool search engines and spam filters with fake content not visible to human viewers.
Here is a sample: http://thot.banki.hu/deepspam/poison.html
The text was updated successfully, but these errors were encountered: