-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid internal link error raised for template within <script> tag #149
Comments
@satyamtg thank you for reporting this issue. Can you please provide the problematic zimfile? |
@MiguelRocha you can find a ZIM here - https://farm.openzim.org/pipeline/5f3cf02c5684802b46244c2a |
For future reference: phzh_core-english-one_en_2020-08.zim |
@rgaudin @MiguelRocha @satyamtg It seems difficult from the
|
@kelson42 this is scraped directltly from the instance during fetching of all CSS/JS (i.e. all |
How is this feature handled in zimcheck? Is this through an actual DOM parser or via regexp? Might be real tricky if the latter. |
@rgaudin This is a regex parser |
Agree with @rgaudin, here the core of the problem is that we have a regex based parser and we should have a DOM based. |
@maneeshpm Might be a good candidate for you, replacing the functions which retrieve the link with a DOM (pugixml) parser. |
Thanks @kelson42! Looking into this issue. |
@maneeshpm Thank you very much. Ifyou have other tasks ongoing, please try to finish them first. |
@kelson42 Sure, they are almost done(pending final review and merge). I will start this as soon as they are closed. |
@kelson42 To make sure I understand the issue correctly, the expected behavior is to extract all |
@kelson42 Parsing with
If we can get rid of these, pugixml works well in tree_walker traverse mode. Exploring workarounds. |
@maneeshpm This is a really pertinent remark. pugixml seems indeed not the properly tool. Not sure for the moment how to proceed. |
Got this problem again with a different, simpler use case. When using Actually, AFAIK, zimcheck raises also for |
Gumbo (https://github.com/google/gumbo-parser) is probably a better parser to parse html. |
This looks interesting, an html5 specific parser like Gumbo is more suited for our need than depending on an XML based parser like Pugixml. |
This ticket is clearly blocked by #331 |
zimcheck
raises Invalid internal link errors as follows for a image link which is actually a template inside a <script> tag in PHZH ZIM files created byopenedx2zim
. The error looks something like this -The script tag within which this template is as follows -
This results in many false errors as this is present in nearly all HTML files in PHZH ZIMs
The text was updated successfully, but these errors were encountered: