-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove adds and other uninteresting text from articles #25
Comments
Adblockplus is written in python: |
Can you please provide some examples? In majority of cases the advertisements are made using JavaScript (to retrieve the content from another website) and if we are looking only to the HTML there is no problem... |
You are right. We need to look at the actual articles to define more precisely what we need to remove. I am hoping to be able to easy browse the articles, when we pull the visualizations Elisa is working on. |
So let's hold this issue for a while and restart thinking on it when we have more data about it. Tagged as "question". |
Found a great solution. The Goose Library can extract the article text from the the raw html with high accuracy. https://github.com/grangier/python-goose |
article pages may have advertisements and other useless text which ca generate noise in the NLP analyses. We need to find a way to robustly remove them.
Maybe inspecting the code of adblockin firefox extensions can help
The text was updated successfully, but these errors were encountered: