LookupError when giving url as one that is already saved on the disk (file:///) #13

sekon · 2013-09-28T08:23:58Z

Hello,
Firstly thank you for python-boilerpipe.
When i use wget to get the page http://www.flipkart.com/dell-xps-13-laptop-2nd-gen-ci7-4gb-256gb-ssd-win7-hp/p/itmdg387gmhzhx3m and save it on my disk and then try to open it with python-boilerpipe using the code

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='DefaultExtractor', url="file:///home/code/code/opinion/boilerpipe-binary/itmdg387gmhzhx3m")
#extractor = Extractor(extractor='DefaultExtractor', url="http://www.flipkart.com/dell-xps-13-laptop-2nd-gen-ci7-4gb-256gb-ssd-win7-hp/p/itmdg387gmhzhx3m")
extracted_text = extractor.getText()
extracted_html = extractor.getHTML()
print extracted_html

I get the following error

Traceback (most recent call last):
  File "htmlExtractor.py", line 2, in <module>
    extractor = Extractor(extractor='DefaultExtractor', url="file:///home/code/code/opinion/boilerpipe-binary/itmdg387gmhzhx3m")
  File "/usr/local/lib/python2.7/dist-packages/boilerpipe/extract/__init__.py", line 41, in __init__
    self.data = unicode(self.data, encoding)
LookupError: unknown encoding: text/plain

I have already setup a spider with scrapy, so processing files on the disk is very important for me.

Warm regards,
Harish Badrinath

The text was updated successfully, but these errors were encountered:

sekon · 2013-09-29T12:45:05Z

Hello,
Just an update changing itmdg387gmhzhx3m to itmdg387gmhzhx3m.html gives me the output, the problem seems to lie in using connection.headers['content-type'] to determine encoding (in line 41 src/boilerpipe/extract/init.py). A possible fix may lie in using python-magic, but that just works for local files and the file location cant be URL like file:///

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LookupError when giving url as one that is already saved on the disk (file:///) #13

LookupError when giving url as one that is already saved on the disk (file:///) #13

sekon commented Sep 28, 2013

sekon commented Sep 29, 2013

LookupError when giving url as one that is already saved on the disk (file:///) #13

LookupError when giving url as one that is already saved on the disk (file:///) #13

Comments

sekon commented Sep 28, 2013

sekon commented Sep 29, 2013