Nested loops are broken in scrapemark 0.9 #6

arshaw · 2011-02-09T05:09:23Z

Reported by [email protected], Dec 26, 2009

What steps will reproduce the problem?

Run the nested loops example from
http://arshaw.com/scrapemark/docs/examples/

What is the expected output? What do you see instead?

Expected output is:
{'days': [{'number': 1, 'points': [5.6, 24.5]},
{'number': 2, 'points': [1.1, 12.8]},
{'number': 3, 'points': [2.4, 5.67]}]}

Instead, you get:
{'days': [{'points': [5.6], 'number': 1},
{'points': [24.5]},
{'points': [1.1], 'number': 2},
{'points': [12.8]},
{'points': [2.4], 'number': 3},
{'points': [5.67], 'number': 0}]}

What version of the product are you using? On what operating system?

v0.9

Please provide any additional information below.

This is a regression from scrapemark.py r2, which works fine.

mtaran · 2011-03-27T06:00:52Z

scrapemark would be the absolute best template-based html scraper if it weren't for this bug! I really hope you have a chance to fix it soon. I tried my hand at it, but just changing _merge_captures didn't seem to be enough since it looks like it gets called both at times when the master and slave dicts should be fully merged and when they shouldn't.

I also tried modifying the examples you had into doctest-compatible docstrings, like so:
'''

Scrape some text:

>>> scrape("""
...    <title>:: {{ page_title }}</title>
...    """,
...    html)
{'page_title': u'The Page Title'}


Scrape some text (quick version):

>>> scrape("""
...    <title>:: {{ }}</title>
...    """,
...    html)
u'The Page Title'

Loop over certain divs, scrape a list:

>>> scrape("""
...    <body>
...    {*
...        <div class='section' id='{{ [section_ids] }}' />
...    *}
...    </body>
...    """,
...    html)
{'section_ids': [u'content', u'footer']}


Scrape text before a certain element:

>>> scrape("""
...    <div id='content'>
...    {{ before_table }}
...    <table />
...    </div>
...    """,
...    html)
{'before_table': u'Look at these data points'}


Scrape a column from a table (as a list of ints):

>>> scrape("""
...    <table>
...    <tr />
...    {*
...        <tr>
...        <td>{{ [day_numbers]|int }}</td>
...        </tr>
...    *}
...    </table>
...    """,
...    html)
{'day_numbers': [1, 2, 3]}

Scrape the entire table with nested loops and dot-notation:
>>> scrape("""
...    <table>
...    <tr />
...    {*
...        <tr>
...        <td>{{ [days].number|int }}</td>
...        {*
...            <td>{{ [days].[points]|float }}</td>
...        *}
...        </tr>
...    *}
...    </table>
...    """,
...    html)
{'days': [{'number': 1, 'points': [1.0, 1.5]},
          {'number': 2, 'points': [2.0, 2.5]},
          {'number': 3, 'points': [3.0, 3.5]}]}

'''

which would hopefully make it easier to do regression tests...

Anyways, I'd be really happy if you could get this fixed sometime :D

Tell me if there's anything I could do to help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested loops are broken in scrapemark 0.9 #6

Nested loops are broken in scrapemark 0.9 #6

arshaw commented Feb 9, 2011

mtaran commented Mar 27, 2011

Nested loops are broken in scrapemark 0.9 #6

Nested loops are broken in scrapemark 0.9 #6

Comments

arshaw commented Feb 9, 2011

mtaran commented Mar 27, 2011