What to test with zimcheck and how ? #340

mgautierfr · 2023-03-27T15:40:14Z

This issue is created before this interesting discussion about what to test started in #339 becomes too big and cannibalizes the PR review.

Initially from @rgaudin:

I have mixed feelings about this. On one hand, this mainly highlights the shortcomings of such an approach but on the other hand, simple checks are better than none.

Couple of comments (already identified):

I don't think a language regexp makes sense. If you want to ensure validity, you can only work of a valid list IMO. Making sure that it's 3 lowercase letters (or proper list of such) would already be beneficial.
date regexp wont capture dates such as 2023-13-01. Going the extra mile (as suggested) would not guarantee a valid date with cases like Feb 29th… that would require a date dependency
min length for all those string metadata looks problematic. Max length as well as long as the spec is silent about it. We could raise a warning above a certain threshold though.
I am not a fan of the “LongDescription shouldn't be shorter than Description” because that's not in the spec.
“Description must be in the language of the ZIM file” is an interesting than deserves a discussion and decision. I am in favor of updating the spec on this to specify whatever we decide.
Illustrations are not checked despite it being the ticket's moto.

Once again we fall short on setting clear goals for our tools. zimcheck's description is “zimcheck checks the quality of a ZIM file.”. Does that mean that whenever zimcheck doesn't report an issue, the ZIM is guaranteed to be valid?

I join @kelson42 in thinking we want basic checks for now that we could extend in the future.

Simple date format regex
Simple format regexp for language
magic number check for Illustrations
len >= 1 for mandatory texts
len <= max for bounded texts

And that's it. The rest can be discussed and extended in separate tickets, raised by actual needs.

Although it serves a different purpose, scraperlib now (not being used yet) enforces correct metadata with more elaborate checks (actual language code, proper PNG with correct sizes, etc) so most of what we produce shall be valid in this regard.

mgautierfr · 2023-03-27T15:40:50Z

Then by @holta

Very thoughtful response from @rgaudin, and a big thank you to all building_construction

I strongly support organic and free-form metadata standards (what's needed are strong norms and strong guidelines not bureaucratic rules) that allow grassroots initiatives to collaborate & innovate efficiently.

In fact even semi-structured data sometimes has an extremely valuable place along the way — thereby empowering regional and specialized communities to build their own ZIM files, with the metadata that their region/profession/culture truly needs.

For this reason I very strongly support allowing "free-form metadata fields" that not only permit but encourage grassroots (not centralized) community innovation to truly flourish.

Then later on, as strong community norms are independently nurtured + demonstrated + proven year-by-year-by-year, the world should honor those great grassroots practices — as they become more official metadata standards.

Central authorities (Kiwix) should provide basic guardrails & guidelines of course, but that's sufficient +1

Thank you to everyone including @veloman-yunkan and @kelson42 and @mgautierfr working very hard on this critical question, helping it to evolve quickly in coming years, and every step of the way.

mgautierfr · 2023-03-27T15:59:40Z

This is a interesting question.

I mainly see two kinds of testing:

The first one regroups all tests that are technically mandatory. The exact definition is subject to discussion, but at first glance, I would say that a failing test in this category would make libzim raised a exception at a moment. I can think of:

All invalid data breaking the zim format (pointer pointing outside of zim files, blob number greater than number of blob in the cluster, ...)
Wrong redirection (redirection pointing to nothing, or redirection loop)
Missing namespaces (C at least, others ?)
In a certain extend, missing metadata. (Libzim itself doesn't depends on them, but if libkiwix expects, for example, "Publisher" to be present we will have a exception if it is not)
other ?

In the second group, I would put all other tests that may be good to have (for better quality) but not mandatory:

Metadata length : A too long (or empty) metadata may break displaying style, make search too complex, ... but it will not "crash" the software.
Wrong tags or lang : Same here, a wrong tag/lang will mixup the ui only but the zim file is still "valid".
Missing checksum ?
other ?

I would say that the first group are error when the second group are warning.
But nothing prevent us to have a option as -Werror to treat all warnings as errors when user want to be pedantic (us in zimfarm)

mgautierfr mentioned this issue Mar 27, 2023

Centralized metadata constraints #339

Merged

rgaudin changed the title ~~What to test with zim-test and how ?~~ What to test with zimcheck and how ? Mar 27, 2023

kelson42 added question zimcheck enhancement labels Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What to test with zimcheck and how ? #340

What to test with zimcheck and how ? #340

mgautierfr commented Mar 27, 2023 •

edited

Loading

mgautierfr commented Mar 27, 2023

mgautierfr commented Mar 27, 2023

What to test with zimcheck and how ? #340

What to test with zimcheck and how ? #340

Comments

mgautierfr commented Mar 27, 2023 • edited Loading

mgautierfr commented Mar 27, 2023

mgautierfr commented Mar 27, 2023

mgautierfr commented Mar 27, 2023 •

edited

Loading