remove adds and other uninteresting text from articles #25

fccoelho · 2013-11-02T18:28:51Z

article pages may have advertisements and other useless text which ca generate noise in the NLP analyses. We need to find a way to robustly remove them.
Maybe inspecting the code of adblockin firefox extensions can help

fccoelho · 2013-11-02T18:34:55Z

Adblockplus is written in python:
https://adblockplus.org/en/source

turicas · 2013-11-05T19:52:58Z

Can you please provide some examples? In majority of cases the advertisements are made using JavaScript (to retrieve the content from another website) and if we are looking only to the HTML there is no problem...

fccoelho · 2013-11-05T19:57:17Z

You are right. We need to look at the actual articles to define more precisely what we need to remove. I am hoping to be able to easy browse the articles, when we pull the visualizations Elisa is working on.

turicas · 2013-11-05T20:03:45Z

So let's hold this issue for a while and restart thinking on it when we have more data about it. Tagged as "question".

fccoelho · 2014-08-31T13:43:07Z

Found a great solution. The Goose Library can extract the article text from the the raw html with high accuracy. https://github.com/grangier/python-goose
we can useit straigth away in the downloader, and generate new fields in the article collection named "main_text" and "main_html" as well as keeping other metadata extracted by goose as well.

fccoelho added enhancement and removed question labels Aug 31, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove adds and other uninteresting text from articles #25

remove adds and other uninteresting text from articles #25

fccoelho commented Nov 2, 2013

fccoelho commented Nov 2, 2013

turicas commented Nov 5, 2013

fccoelho commented Nov 5, 2013

turicas commented Nov 5, 2013

fccoelho commented Aug 31, 2014

remove adds and other uninteresting text from articles #25

remove adds and other uninteresting text from articles #25

Comments

fccoelho commented Nov 2, 2013

fccoelho commented Nov 2, 2013

turicas commented Nov 5, 2013

fccoelho commented Nov 5, 2013

turicas commented Nov 5, 2013

fccoelho commented Aug 31, 2014