Lesserwrong.com page cannot be parsed #385

tinloaf · 2018-01-09T09:56:34Z

Hi,

I'd like to parse pages from www.lesserwrong.com. I've tried creating a site config based on this page:
https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

This is how my site config looks like:

title: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-title ')]//h1
body: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-html ')]
date: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-metadata-date ')]
author: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-author ')]
test_url: https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

As far as I can tell, these XPaths all point to the correct elements inside that page. Howeve, the tool at https://f43.me/feed/test still fails to parse the page. Did I mess up the site config, or is this a bug in the parser (and if so, is this the right repository to report such a bug?)

The text was updated successfully, but these errors were encountered:

j0k3r · 2018-01-09T10:07:37Z

I think the problem isn't on your side but on lesserwrong.com which is using Cloudflare so I guess it might be the same issue than wallabag/wallabag#1399 (comment)

tinloaf · 2018-01-09T10:34:55Z

That might very well be it. Is there a way of seeing the HTML that the parser sees? Then I could verify that it's in fact the Cloudflare anti-bot page.

j0k3r · 2018-01-09T10:36:38Z

Without going into the code of wallabag/graby, no you can't.
Find that file and var_dump() the $html: https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L203

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lesserwrong.com page cannot be parsed #385

Lesserwrong.com page cannot be parsed #385

tinloaf commented Jan 9, 2018

j0k3r commented Jan 9, 2018 •

edited

Loading

tinloaf commented Jan 9, 2018

j0k3r commented Jan 9, 2018 •

edited

Loading

Lesserwrong.com page cannot be parsed #385

Lesserwrong.com page cannot be parsed #385

Comments

tinloaf commented Jan 9, 2018

j0k3r commented Jan 9, 2018 • edited Loading

tinloaf commented Jan 9, 2018

j0k3r commented Jan 9, 2018 • edited Loading

j0k3r commented Jan 9, 2018 •

edited

Loading

j0k3r commented Jan 9, 2018 •

edited

Loading