Proposal: New API to replace existing arrays in npz files #68

zhihaoy · 2020-02-27T00:22:05Z

Currently, dump_npz either destroys all existing arrays in an npz file or append arrays with names that exist in the npz file as duplicated entries. This is a rather strange semantics, especially when loading individual arrays with load_npz doesn't follow those semantics.

A reasonable semantics would be replacing existing arrays. This requires a context object, with a role similar to HighFive::File.

NumPy doesn't support this. They only overwrite all arrays at once.

I checked out libzippp and libzip++, neither work with streams. It's primarily because dump_npy_stream and libzip are both "push" interfaces (thus, both need to run the main loop), so they cannot work together without writing a special stream class that serves as a pipe so that libzip can pull data from it...

I think there is a rather simple way to support replacing semantics given a context object. When an npz_file is opened for update, append as usual while keeping the central directory as a data structure in memory. When closing, write a temporary file with only the up-to-date arrays, finish the file with the central directory entry, do an atomic move to replace the old file. (It's possible to shrink a file in-place, but I guess that would invalidate too much I/O buffer).

The text was updated successfully, but these errors were encountered:

JohanMabille · 2020-02-27T09:43:23Z

Thanks for your proposal, I think having this feature would be really nice! Regarding the API, the boolean parameter append_to_existing_file should be replaced with a mode parameter (ideally a simple enum containing overwrite and replace), so it is clear the change is backward incompatible and upgrading to the last version will break the code (instead of having a failure at runtime only).

zhihaoy · 2020-02-27T16:33:22Z

Regarding the API, the boolean parameter append_to_existing_file should be replaced with a mode parameter (ideally a simple enum containing overwrite and replace), so it is clear the change is backward incompatible and upgrading to the last version will break the code (instead of having a failure at runtime only).

I may fail to explain this part well: I was not proposing to change the existing dump_npz. I suggest to add a context class named xt::npz_file, then add a dump member function or an free-function overload of xt::dump that takes xt::npz_file&. Because shrinking the file after replacing each array would be inefficient, but doing that at context closing time is fine.

After doing that, we can consider whether to make a change as you suggested on dump_npz. Although it's not as efficient as the xt::dump API when updating many arrays, the user may want to update only one array.

(I'm going to start implement the necessary bits.)

PS: Switching to a mode enum has an additional benefit: we can deprecate append_to_existing_file with a migration suggestion: https://godbolt.org/z/rTPgbo

JohanMabille · 2020-03-02T20:24:00Z

Thanks for the clarification. Let's proceed in two steps then, first adding an overload to dump_npz and then see how we can improve the existing API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: New API to replace existing arrays in npz files #68

Proposal: New API to replace existing arrays in npz files #68

zhihaoy commented Feb 27, 2020

JohanMabille commented Feb 27, 2020

zhihaoy commented Feb 27, 2020

JohanMabille commented Mar 2, 2020

Proposal: New API to replace existing arrays in npz files #68

Proposal: New API to replace existing arrays in npz files #68

Comments

zhihaoy commented Feb 27, 2020

JohanMabille commented Feb 27, 2020

zhihaoy commented Feb 27, 2020

JohanMabille commented Mar 2, 2020