Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove adds and other uninteresting text from articles #25

Open
fccoelho opened this issue Nov 2, 2013 · 5 comments
Open

remove adds and other uninteresting text from articles #25

fccoelho opened this issue Nov 2, 2013 · 5 comments

Comments

@fccoelho
Copy link
Member

fccoelho commented Nov 2, 2013

article pages may have advertisements and other useless text which ca generate noise in the NLP analyses. We need to find a way to robustly remove them.
Maybe inspecting the code of adblockin firefox extensions can help

@fccoelho
Copy link
Member Author

fccoelho commented Nov 2, 2013

Adblockplus is written in python:
https://adblockplus.org/en/source

@turicas
Copy link
Contributor

turicas commented Nov 5, 2013

Can you please provide some examples? In majority of cases the advertisements are made using JavaScript (to retrieve the content from another website) and if we are looking only to the HTML there is no problem...

@fccoelho
Copy link
Member Author

fccoelho commented Nov 5, 2013

You are right. We need to look at the actual articles to define more precisely what we need to remove. I am hoping to be able to easy browse the articles, when we pull the visualizations Elisa is working on.

@turicas
Copy link
Contributor

turicas commented Nov 5, 2013

So let's hold this issue for a while and restart thinking on it when we have more data about it. Tagged as "question".

@fccoelho
Copy link
Member Author

Found a great solution. The Goose Library can extract the article text from the the raw html with high accuracy. https://github.com/grangier/python-goose
we can useit straigth away in the downloader, and generate new fields in the article collection named "main_text" and "main_html" as well as keeping other metadata extracted by goose as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants