Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid internal link error raised for template within <script> tag #149

Open
satyamtg opened this issue Aug 19, 2020 · 19 comments
Open

Invalid internal link error raised for template within <script> tag #149

satyamtg opened this issue Aug 19, 2020 · 19 comments

Comments

@satyamtg
Copy link

zimcheck raises Invalid internal link errors as follows for a image link which is actually a template inside a <script> tag in PHZH ZIM files created by openedx2zim. The error looks something like this -

- <%= largeSRC %>
(A/course/core-english-01/topic-5-my-feelings-and-myself/unit-1-dealing-with-different-emotions/step-2/</largeSRC /) were not found in article A/course/core-english-01/topic-5-my-feelings-and-myself/unit-1-dealing-with-different-emotions/step-2/index.html

The script tag within which this template is as follows -

<script id="image-modal-tpl" type="text/template">
    <div class="wrapper-modal wrapper-modal-image">
  <section class="image-link">
    <%= smallHTML%>
    <a href="#" class="modal-ui-icon action-fullscreen" role="button">
      <span class="label">
        <span class="icon fa fa-arrows-alt fa-large" aria-hidden="true"></span> <%- gettext("Fullscreen") %>
      </span>
    </a>
  </section>

  <section class="image-modal">
    <section class="image-content">
      <div class="image-wrapper">
        <img alt="<%= largeALT %>, <%- gettext('Large') %>" src="<%= largeSRC %>" />
      </div>

      <a href="#" class="modal-ui-icon action-close" role="button">
        <span class="label">
          <span class="icon fa fa-remove fa-large" aria-hidden="true"></span> <%- gettext("Close") %>
        </span>
      </a>

      <ul class="image-controls">
        <li class="image-control">
          <a href="#" class="modal-ui-icon action-zoom-in" role="button">
            <span class="label">
              <span class="icon fa fa fa-search-plus fa-large" aria-hidden="true"></span> <%- gettext("Zoom In") %>
            </span>
          </a>
        </li>

        <li class="image-control">
          <a href="#" class="modal-ui-icon action-zoom-out is-disabled" aria-disabled="true" role="button">
            <span class="label">
              <span class="icon fa fa fa-search-minus fa-large" aria-hidden="true"></span> <%- gettext("Zoom Out") %>
            </span>
          </a>
        </li>
      </ul>
    </section>
  </section>
</div>

</script>

This results in many false errors as this is present in nearly all HTML files in PHZH ZIMs

@MiguelRocha
Copy link
Contributor

@satyamtg thank you for reporting this issue. Can you please provide the problematic zimfile?

@kelson42 kelson42 changed the title Invalid innternal link error raised for template within <script> tag Invalid internal link error raised for template within <script> tag Aug 19, 2020
@satyamtg
Copy link
Author

@MiguelRocha
Copy link
Contributor

For future reference: phzh_core-english-one_en_2020-08.zim

@kelson42
Copy link
Contributor

@rgaudin @MiguelRocha @satyamtg It seems difficult from the zimcheck perspective to easily make the difference beetween a template and a "noirmal" HTML page. IMO We are missing here some kind of meta information or something else is wrong. I wonder for example:

  • Is this a js template?
  • why we have template at all here?
  • why this is in considered as a normal HTML article? in A/? With the std text/html mime-type?

@satyamtg
Copy link
Author

@rgaudin @MiguelRocha @satyamtg It seems difficult from the zimcheck perspective to easily make the difference beetween a template and a "noirmal" HTML page. IMO We are missing here some kind of meta information or something else is wrong. I wonder for example:

  • Is this a js template?
  • why we have template at all here?
  • why this is in considered as a normal HTML article? in A/? With the std text/html mime-type?

@kelson42 this is scraped directltly from the instance during fetching of all CSS/JS (i.e. all script, style and link (with rel="stylesheet") tags from the HTML) to apply the instance style on the offline version. It's considered as a normal HTML and is in the A/ namespace because its written within the HTML in a <script> tag, and is not in a separate file.

@rgaudin
Copy link
Member

rgaudin commented Aug 28, 2020

How is this feature handled in zimcheck? Is this through an actual DOM parser or via regexp? Might be real tricky if the latter.

@kelson42
Copy link
Contributor

@rgaudin This is a regex parser

@kelson42 kelson42 pinned this issue Nov 17, 2020
@kelson42
Copy link
Contributor

Agree with @rgaudin, here the core of the problem is that we have a regex based parser and we should have a DOM based.

@kelson42
Copy link
Contributor

kelson42 commented Jan 29, 2021

@maneeshpm Might be a good candidate for you, replacing the functions which retrieve the link with a DOM (pugixml) parser.

@maneeshpm
Copy link
Contributor

Thanks @kelson42! Looking into this issue.

@kelson42
Copy link
Contributor

@maneeshpm Thank you very much. Ifyou have other tasks ongoing, please try to finish them first.

@maneeshpm
Copy link
Contributor

@kelson42 Sure, they are almost done(pending final review and merge). I will start this as soon as they are closed.

@maneeshpm
Copy link
Contributor

maneeshpm commented Jan 31, 2021

@kelson42 To make sure I understand the issue correctly, the expected behavior is to extract all src and href attributes. The current function generic_getLinks(), while doing this also gathers some src inside the script tags which causes failure in zimcheck.
So we want to rewrite this function using pugixml such that all src and href attributes of nodes that are not child of a script node(but can be an attribute of script node itself) are collected.
Is this correct?

@maneeshpm
Copy link
Contributor

@kelson42 Parsing with pugixml also causes some errors. Since the file we are trying to parse is not exactly XML, some elements inside the <script> node causes this failure. Those elements are

  • non closed tags: <%= smallHTML%>, <%- gettext("Fullscreen") %>
  • invalid attributes:
    "dismiss": '<a aria-label="dismiss cookie message" id="dismiss" role=button tabindex="2" class="cc-btn cc-dismiss:focus">{{dismiss}}</a>', button should be inside quotes.

If we can get rid of these, pugixml works well in tree_walker traverse mode. Exploring workarounds.

@kelson42
Copy link
Contributor

kelson42 commented Feb 4, 2021

@maneeshpm This is a really pertinent remark. pugixml seems indeed not the properly tool. Not sure for the moment how to proceed.

@rgaudin
Copy link
Member

rgaudin commented Jul 1, 2021

Got this problem again with a different, simpler use case.
The ZIM is at http://tmp.kiwix.org/beer_meta.zim

When using zimcheck -U on it, it will raise a Not Found Article for static/css/mobile.css. You shouldn't worry about the rest of the errors nor the reason it's not finding the article. The fact that this is mentioned only within a <script /> is the problem here.

Actually, AFAIK, zimcheck raises also for <blockquote />, <pre /> and <code /> all of those should be excluded.

@mgautierfr
Copy link
Collaborator

@maneeshpm This is a really pertinent remark. pugixml seems indeed not the properly tool. Not sure for the moment how to proceed.

Gumbo (https://github.com/google/gumbo-parser) is probably a better parser to parse html.
Fortunately, it is already a dependencies of zim-tools (through zimwriterfs), so we just have to use it :)

@maneeshpm
Copy link
Contributor

This looks interesting, an html5 specific parser like Gumbo is more suited for our need than depending on an XML based parser like Pugixml.

@kelson42 kelson42 unpinned this issue Feb 16, 2023
@kelson42
Copy link
Contributor

This ticket is clearly blocked by #331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants