-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuters' crawler #95
base: master
Are you sure you want to change the base?
Reuters' crawler #95
Conversation
@flavioamieiro and @lucasmachadorj, I think we shoud just move the soupparser.py from lxml into our goose fork and avoid having to fork lxml as well. What do you think? |
@fccoelho I think it is a good idea. I've tested and it worked. If @flavioamieiro agree I'll commit this change. |
I think it's a good idea, but please leave it clear (as comments in the file) where it comes from and why we need it there. In general, I'm not a big fan of maintaining our own forks of libraries. It's a lot of work, and it's very easy to lag behind (specially in bug fixes and features). But if we need to use python3 and there's no intention by the goose community of supporting this version of python in the foreseeable future, this may be our only option. I'd just ask you to add goose on the requirements pointing to your fork (as we did in pypln.backend for wordcloud). |
@flavioamieiro I agree with your observations on the documentation of the "pulled in" module. |
I personally don't have a lot of experience with git submodules, but I think it may be worth a try. It may be a good way to keep it up to date. But using pip's |
I added Pillow to requirements.txt because goose uses PIL and Pillow is its fork compatible with python 3. |
@flavioamieiro please, review this code . It's necessary to run it with python 3.
As discussed with @fccoelho I'm porting the crawlers to python 3 and was also necessary to port goose-extractor, which is here: https://github.com/lucasmachadorj/python-goose
With this version, we can extract title and content without errors, but I will work to make the whole package compatible with python 2 and 3.
Furthermore, it's necessary to use this version of lxml: https://github.com/lucasmachadorj/lxml
lxml makes use of beautifulSoup, which is not compatible with python 3, but the above version fix this problem. The only script really used from lxml by goose is soupparser: https://github.com/lucasmachadorj/lxml/blob/master/src/lxml/html/soupparser.py
Install lxml from source is a bit complex and I don't know the best way we can do it.