Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support other content-encodings, other then utf8 (support Swedish characters) #10

Open
arshaw opened this issue Feb 9, 2011 · 0 comments

Comments

@arshaw
Copy link
Owner

arshaw commented Feb 9, 2011

Reported by [email protected], Aug 13, 2010

When using Scrapemark to get text from Swedish websites or sites that are not using utf-8 as content encoding, which is common here. Scrapemark removes all special characters (åäö), the text "Hjälp" becomes "Hjlp".

What steps will reproduce the problem?

  1. Use a url of a homepage with content-encoding iso-8859-1 for example this Swedish homepage http://www.asciitabell.se/
  2. Scrape the <title>{{ }}</title>

What is the expected output? What do you see instead?
The output is "ASCII-tabellen (8 bitars utkad ASCII, enligt ISO 8859-1)" the expected result would be "ASCII-tabellen (8 bitars utökad ASCII, enligt ISO 8859-1)" (notice the o with two dots in the middle :))

What version of the product are you using? On what operating system?
Version 0.9, Mac OSX Snow Leopard

Please provide any additional information below.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.

diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
 def _substitute_entity(m):
    ent = m.group(2)
    if m.group(1) == "#":
-       return unichr(int(ent))
+       # Hex value
+       if ent[0] == 'x':
+           return unichr(int(ent[1:], 16))
+       else:
+           return unichr(int(ent))
    else:
        cp = name2codepoint.get(ent)
        if cp:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant