Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuters' crawler #95

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

lucasmachadorj
Copy link
Contributor

@flavioamieiro please, review this code . It's necessary to run it with python 3.

As discussed with @fccoelho I'm porting the crawlers to python 3 and was also necessary to port goose-extractor, which is here: https://github.com/lucasmachadorj/python-goose
With this version, we can extract title and content without errors, but I will work to make the whole package compatible with python 2 and 3.
Furthermore, it's necessary to use this version of lxml: https://github.com/lucasmachadorj/lxml
lxml makes use of beautifulSoup, which is not compatible with python 3, but the above version fix this problem. The only script really used from lxml by goose is soupparser: https://github.com/lucasmachadorj/lxml/blob/master/src/lxml/html/soupparser.py

Install lxml from source is a bit complex and I don't know the best way we can do it.

@fccoelho
Copy link
Member

fccoelho commented Oct 6, 2015

@flavioamieiro and @lucasmachadorj, I think we shoud just move the soupparser.py from lxml into our goose fork and avoid having to fork lxml as well. What do you think?

@lucasmachadorj
Copy link
Contributor Author

@fccoelho I think it is a good idea. I've tested and it worked. If @flavioamieiro agree I'll commit this change.

@flavioamieiro
Copy link
Member

I think it's a good idea, but please leave it clear (as comments in the file) where it comes from and why we need it there.

In general, I'm not a big fan of maintaining our own forks of libraries. It's a lot of work, and it's very easy to lag behind (specially in bug fixes and features). But if we need to use python3 and there's no intention by the goose community of supporting this version of python in the foreseeable future, this may be our only option.

I'd just ask you to add goose on the requirements pointing to your fork (as we did in pypln.backend for wordcloud).

@fccoelho
Copy link
Member

fccoelho commented Oct 7, 2015

@flavioamieiro I agree with your observations on the documentation of the "pulled in" module.
Regarding forking goose, as far as I know we don't have an option because the development of the original goose seems to have stopped. Today there are many forks which go in many different directions. Instead of picking one of these I think it is best to maintain our own. I also suggest we use git submodule to keep Lucas' fork within our own, to facilitate the deploy of MC backend. What do you think?

@flavioamieiro
Copy link
Member

I personally don't have a lot of experience with git submodules, but I think it may be worth a try. It may be a good way to keep it up to date. But using pip's -e git+<url> option works pretty well also (we do it for our wordcloud fork and haven't had problems).

@lucasmachadorj
Copy link
Contributor Author

I added Pillow to requirements.txt because goose uses PIL and Pillow is its fork compatible with python 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants