You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Scrapemark to get text from Swedish websites or sites that are not using utf-8 as content encoding, which is common here. Scrapemark removes all special characters (åäö), the text "Hjälp" becomes "Hjlp".
What steps will reproduce the problem?
Use a url of a homepage with content-encoding iso-8859-1 for example this Swedish homepage http://www.asciitabell.se/
Scrape the <title>{{ }}</title>
What is the expected output? What do you see instead?
The output is "ASCII-tabellen (8 bitars utkad ASCII, enligt ISO 8859-1)" the expected result would be "ASCII-tabellen (8 bitars utökad ASCII, enligt ISO 8859-1)" (notice the o with two dots in the middle :))
What version of the product are you using? On what operating system?
Version 0.9, Mac OSX Snow Leopard
Please provide any additional information below.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.
diff --git a/scrapemark.py b/scrapemark.py
index 7b4cf72..be0327c 100644
--- a/scrapemark.py
+++ b/scrapemark.py
@@ -530,7 +530,11 @@ def _decode_entities(s):
def _substitute_entity(m):
ent = m.group(2)
if m.group(1) == "#":
- return unichr(int(ent))
+ # Hex value
+ if ent[0] == 'x':
+ return unichr(int(ent[1:], 16))
+ else:
+ return unichr(int(ent))
else:
cp = name2codepoint.get(ent)
if cp:
The text was updated successfully, but these errors were encountered:
Reported by [email protected], Aug 13, 2010
When using Scrapemark to get text from Swedish websites or sites that are not using utf-8 as content encoding, which is common here. Scrapemark removes all special characters (åäö), the text "Hjälp" becomes "Hjlp".
What steps will reproduce the problem?
<title>{{ }}</title>
What is the expected output? What do you see instead?
The output is "ASCII-tabellen (8 bitars utkad ASCII, enligt ISO 8859-1)" the expected result would be "ASCII-tabellen (8 bitars utökad ASCII, enligt ISO 8859-1)" (notice the o with two dots in the middle :))
What version of the product are you using? On what operating system?
Version 0.9, Mac OSX Snow Leopard
Please provide any additional information below.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.
I wrote a patch that fixes this by simple looking at the return header, if the header includes iso-8859 the result is decoded and then encoded to utf8 before sent to other functions. This could possible be done more generic to work with different content-encodings other then iso-8859.
The text was updated successfully, but these errors were encountered: