Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lesserwrong.com page cannot be parsed #385

Open
tinloaf opened this issue Jan 9, 2018 · 3 comments
Open

Lesserwrong.com page cannot be parsed #385

tinloaf opened this issue Jan 9, 2018 · 3 comments

Comments

@tinloaf
Copy link
Contributor

tinloaf commented Jan 9, 2018

Hi,

I'd like to parse pages from www.lesserwrong.com. I've tried creating a site config based on this page:
https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

This is how my site config looks like:

title: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-title ')]//h1
body: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-html ')]
date: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-metadata-date ')]
author: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-author ')]
test_url: https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

As far as I can tell, these XPaths all point to the correct elements inside that page. Howeve, the tool at https://f43.me/feed/test still fails to parse the page. Did I mess up the site config, or is this a bug in the parser (and if so, is this the right repository to report such a bug?)

@j0k3r
Copy link
Collaborator

j0k3r commented Jan 9, 2018

I think the problem isn't on your side but on lesserwrong.com which is using Cloudflare so I guess it might be the same issue than wallabag/wallabag#1399 (comment)

@tinloaf
Copy link
Contributor Author

tinloaf commented Jan 9, 2018

That might very well be it. Is there a way of seeing the HTML that the parser sees? Then I could verify that it's in fact the Cloudflare anti-bot page.

@j0k3r
Copy link
Collaborator

j0k3r commented Jan 9, 2018

Without going into the code of wallabag/graby, no you can't.
Find that file and var_dump() the $html: https://github.com/j0k3r/graby/blob/master/src/Extractor/ContentExtractor.php#L203

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants