Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception while parsing things like '<a href="">Some text</a>' #8

Open
arshaw opened this issue Feb 9, 2011 · 1 comment
Open

Exception while parsing things like '<a href="">Some text</a>' #8

arshaw opened this issue Feb 9, 2011 · 1 comment

Comments

@arshaw
Copy link
Owner

arshaw commented Feb 9, 2011

Reported by [email protected], Apr 27, 2010

** What steps will reproduce the problem?

At the Python console, type
import scrapemark
scrapemark.scrape(
'{* {{ [links].title }} *}',
html = 'Some text'
)

** What is the expected output? What do you see instead?

Expected:
{'links': [{'title': u'Some text', 'url': u''}]}

Actual:
Traceback (most recent call last):
File "", line 3, in
File "scrapemark.py", line 35, in scrape
return pattern.scrape(html, url, get, post, headers, cookie_jar)
File "scrapemark.py", line 93, in scrape
if _match(self._nodes, _remove_comments(html), 0, captures, url, cookie_jar) == -1:
File "scrapemark.py", line 370, in _match
if not _run_special_nodes(special, html[i:], captures, base_url, cookie_jar):
File "scrapemark.py", line 391, in _run_special_nodes
if not _run_special_node(node, s, captures, base_url, cookie_jar):
File "scrapemark.py", line 403, in _run_special_node
i = _match(node[1], s, i, nested_captures, base_url, cookie_jar)
File "scrapemark.py", line 350, in _match
attrs_matched = _match_attrs(node[4], attrs, nested_captures, base_url, cookie_jar)
File "scrapemark.py", line 379, in _match_attrs
m = attr_node[0].match(attrs[name])
TypeError: expected string or buffer

** What version of the product are you using? On what operating system?

scrapemark-0.9, from the source distribution
Mac OS X Version 10.6.3
Python 2.6.1 (r261:67515, Feb 11 2010, 00:51:29)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin

** Please provide any additional information below.

Below is a workaround:

diff -ub scrapemark.py.orig scrapemark.py
--- scrapemark.py.orig  2010-04-28 01:00:58.000000000 -0400
+++ scrapemark.py       2010-04-28 00:59:03.000000000 -0400
@@ -541,7 +541,10 @@
 def _parse_attrs(s):
        attrs = {}
        for m in _attr_re.finditer(s):
-               attrs[m.group(1)] = m.group(3) or m.group(4)
+               value = m.group(3)
+               if value is None:
+                       value = m.group(4)
+               attrs[m.group(1)] = value
        return attrs

    def _next_tag(s, i, tag_open_re, tag_close_re, depth=1): # returns (tag body,         substringindex after tag)
@timClicks
Copy link

This seems to be happen with any empty attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants