Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to test with zimcheck and how ? #340

Open
mgautierfr opened this issue Mar 27, 2023 · 2 comments
Open

What to test with zimcheck and how ? #340

mgautierfr opened this issue Mar 27, 2023 · 2 comments

Comments

@mgautierfr
Copy link
Collaborator

mgautierfr commented Mar 27, 2023

This issue is created before this interesting discussion about what to test started in #339 becomes too big and cannibalizes the PR review.

Initially from @rgaudin:

I have mixed feelings about this. On one hand, this mainly highlights the shortcomings of such an approach but on the other hand, simple checks are better than none.

Couple of comments (already identified):

  • I don't think a language regexp makes sense. If you want to ensure validity, you can only work of a valid list IMO. Making sure that it's 3 lowercase letters (or proper list of such) would already be beneficial.
  • date regexp wont capture dates such as 2023-13-01. Going the extra mile (as suggested) would not guarantee a valid date with cases like Feb 29th… that would require a date dependency
  • min length for all those string metadata looks problematic. Max length as well as long as the spec is silent about it. We could raise a warning above a certain threshold though.
  • I am not a fan of the “LongDescription shouldn't be shorter than Description” because that's not in the spec.
  • “Description must be in the language of the ZIM file” is an interesting than deserves a discussion and decision. I am in favor of updating the spec on this to specify whatever we decide.
  • Illustrations are not checked despite it being the ticket's moto.

Once again we fall short on setting clear goals for our tools. zimcheck's description is “zimcheck checks the quality of a ZIM file.”. Does that mean that whenever zimcheck doesn't report an issue, the ZIM is guaranteed to be valid?

I join @kelson42 in thinking we want basic checks for now that we could extend in the future.

  • Simple date format regex
  • Simple format regexp for language
  • magic number check for Illustrations
  • len >= 1 for mandatory texts
  • len <= max for bounded texts

And that's it. The rest can be discussed and extended in separate tickets, raised by actual needs.

Although it serves a different purpose, scraperlib now (not being used yet) enforces correct metadata with more elaborate checks (actual language code, proper PNG with correct sizes, etc) so most of what we produce shall be valid in this regard.

@mgautierfr
Copy link
Collaborator Author

Then by @holta

Very thoughtful response from @rgaudin, and a big thank you to all building_construction

I strongly support organic and free-form metadata standards (what's needed are strong norms and strong guidelines not bureaucratic rules) that allow grassroots initiatives to collaborate & innovate efficiently.

In fact even semi-structured data sometimes has an extremely valuable place along the way — thereby empowering regional and specialized communities to build their own ZIM files, with the metadata that their region/profession/culture truly needs.

For this reason I very strongly support allowing "free-form metadata fields" that not only permit but encourage grassroots (not centralized) community innovation to truly flourish.

Then later on, as strong community norms are independently nurtured + demonstrated + proven year-by-year-by-year, the world should honor those great grassroots practices — as they become more official metadata standards.

Central authorities (Kiwix) should provide basic guardrails & guidelines of course, but that's sufficient +1

Thank you to everyone including @veloman-yunkan and @kelson42 and @mgautierfr working very hard on this critical question, helping it to evolve quickly in coming years, and every step of the way.

@rgaudin rgaudin changed the title What to test with zim-test and how ? What to test with zimcheck and how ? Mar 27, 2023
@mgautierfr
Copy link
Collaborator Author

This is a interesting question.

I mainly see two kinds of testing:

The first one regroups all tests that are technically mandatory. The exact definition is subject to discussion, but at first glance, I would say that a failing test in this category would make libzim raised a exception at a moment. I can think of:

  • All invalid data breaking the zim format (pointer pointing outside of zim files, blob number greater than number of blob in the cluster, ...)
  • Wrong redirection (redirection pointing to nothing, or redirection loop)
  • Missing namespaces (C at least, others ?)
  • In a certain extend, missing metadata. (Libzim itself doesn't depends on them, but if libkiwix expects, for example, "Publisher" to be present we will have a exception if it is not)
  • other ?

In the second group, I would put all other tests that may be good to have (for better quality) but not mandatory:

  • Metadata length : A too long (or empty) metadata may break displaying style, make search too complex, ... but it will not "crash" the software.
  • Wrong tags or lang : Same here, a wrong tag/lang will mixup the ui only but the zim file is still "valid".
  • Missing checksum ?
  • other ?

I would say that the first group are error when the second group are warning.
But nothing prevent us to have a option as -Werror to treat all warnings as errors when user want to be pedantic (us in zimfarm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants