Multibyte non-utf-8 encoded pages are decoded incorrectly #7

arshaw · 2011-02-09T05:10:41Z

Reported by [email protected], Jan 30, 2010

What steps will reproduce the problem?

Scrape the <title> of http://www.sony.jp/

res = scrapemark.scrape("<title>{{title}}</title>",
    url="http://www.sony.jp/")

Print the result

  print res['title']

What is the expected output? What do you see instead?
Expected result is 'ソニー製品情報 | ソニー'
Instead i get '\j[i | \j['

What version of the product are you using? On what operating system?
Version 0.9 tested on MacOSX and Ubuntu Linux

The text was updated successfully, but these errors were encountered:

bsidhom · 2012-12-18T20:55:48Z

This results because data is automatically interpreted as though it used utf-8 encoding. This page uses SHIFT_JIS. The encoding is correctly detected by chardet.detect (this is a third-party module). If you want, you can change scrapemark to use this internally and automatically detect and decode. Or call scrapemark's fetch_html directly and decode it yourself:

html = scrapemark.fetch_html('http://www.sony.jp/')
text = html.decode(chardet.detect(html)['encoding'])
res = scrapemark.scrape('<title>{{title}}</title>', text)
print res['title']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multibyte non-utf-8 encoded pages are decoded incorrectly #7

Multibyte non-utf-8 encoded pages are decoded incorrectly #7

arshaw commented Feb 9, 2011

bsidhom commented Dec 18, 2012

Multibyte non-utf-8 encoded pages are decoded incorrectly #7

Multibyte non-utf-8 encoded pages are decoded incorrectly #7

Comments

arshaw commented Feb 9, 2011

bsidhom commented Dec 18, 2012