Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LookupError when giving url as one that is already saved on the disk (file:///) #13

Open
sekon opened this issue Sep 28, 2013 · 1 comment

Comments

@sekon
Copy link

sekon commented Sep 28, 2013

Hello,
Firstly thank you for python-boilerpipe.
When i use wget to get the page http://www.flipkart.com/dell-xps-13-laptop-2nd-gen-ci7-4gb-256gb-ssd-win7-hp/p/itmdg387gmhzhx3m and save it on my disk and then try to open it with python-boilerpipe using the code

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='DefaultExtractor', url="file:///home/code/code/opinion/boilerpipe-binary/itmdg387gmhzhx3m")
#extractor = Extractor(extractor='DefaultExtractor', url="http://www.flipkart.com/dell-xps-13-laptop-2nd-gen-ci7-4gb-256gb-ssd-win7-hp/p/itmdg387gmhzhx3m")
extracted_text = extractor.getText()
extracted_html = extractor.getHTML()
print extracted_html

I get the following error

Traceback (most recent call last):
  File "htmlExtractor.py", line 2, in <module>
    extractor = Extractor(extractor='DefaultExtractor', url="file:///home/code/code/opinion/boilerpipe-binary/itmdg387gmhzhx3m")
  File "/usr/local/lib/python2.7/dist-packages/boilerpipe/extract/__init__.py", line 41, in __init__
    self.data = unicode(self.data, encoding)
LookupError: unknown encoding: text/plain

I have already setup a spider with scrapy, so processing files on the disk is very important for me.

Warm regards,
Harish Badrinath

@sekon
Copy link
Author

sekon commented Sep 29, 2013

Hello,
Just an update changing itmdg387gmhzhx3m to itmdg387gmhzhx3m.html gives me the output, the problem seems to lie in using connection.headers['content-type'] to determine encoding (in line 41 src/boilerpipe/extract/init.py). A possible fix may lie in using python-magic, but that just works for local files and the file location cant be URL like file:///

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant