-
-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make zimcheck report its findings without accumulating them all in memory #311
Comments
@kelson42 Alternatively, we may limit the number of errors/warnings - if the purpose of zimcheck is to identify bugs in the ZIM creation software, then we don't have to report all issues as most of them will have the same root cause. That workaround will be easier to implement. |
@veloman-yunkan Question: might that be that the problem is worse because we don't have error codes and store therefore a lot of free text, see #239 ? |
@kelson42 Not exactly. In-memory representation of errors is definitely not optimal. However there is a lower bound to how much we can reduce it. A real solution is to either limit the error count or not accumulate the errors in RAM at all. |
@veloman-yunkan I would prefer to just fix the software elegantly that the memory consumption does not go wild. I understand that following current approach (in RAM), there will be always a "lower bound". 6GB is too high... but from the Kiwix point-of-view, we could probably live with worse case scenarios up to 2GB. Therefore, I wonder if just handling the errors smartly in memory would not be good enough? Would that not be worth an attempt? At least doing that can not be a bad move! |
@kelson42 That can be a waste of time. What if the best we can get after implementing that optimization is still above 2GB for stackoverflow.com_en_all_2022-05.zim? |
I agree that we could change the output behavior to print the errors as soon as we have found them and not wait the end to be able to regroup them by category. We can somehow assume that:
|
zimcheck's traditional (non-JSON) report format enforces an implementation where all findings (errors/warnings) are accumulated in memory so that they can be output grouped by the check type. When support for JSON output format was added the in-memory representation of each finding became more space consuming. As a result on huge ZIM-files with lots of errors zimcheck's memory consumption can be prohibitively large. For example, on a machine with 8GB of RAM the following checks run on stackoverflow.com_en_all_2022-05.zim (which has more than 64M entries) ended like follows:
zimcheck --redirect_loop
completed in 8.5 minutes using 1097 MiB of RAM (there were no errors)zimcheck --url_external
completed in ~2 hours using 1219 MiB of RAM and producing a 79MiB outputzimcheck --url_internal --url_external
reached 6GiB of RAM usage in ~8 hours and was still far from completion. Forcing (with the help of a debugger) premature output produced a 745MiB report. The run was then interrupted.I don't see why we shouldn't change the output format (and the implementation) so that errors/warnings can be output as soon as they are discovered.
The text was updated successfully, but these errors were encountered: