Common serialisation format #726

HDembinski · 2022-03-29T09:40:09Z

HDembinski
Mar 29, 2022
Maintainer

There are two competing requirements for a common serialization format.

a) fast, efficient, opaque
b) slow, inefficient, transparent (human-readable)

The common serialization format should allow Boost.Histogram in C++ to exchange data with boost-histogram in Python, among other things.

TODO: Add options which we have discussed so far with pros and cons.

TODO for Hans: Explain how the Boost.Serialization protocol works that we already use and share between C++ and Python and we should continue using.

jpivarski · 2022-03-29T16:21:28Z

jpivarski
Mar 29, 2022
Maintainer

I'll be repeating a little bit of what I said in the email thread, but I'm getting it in public here.

First point: histogram serialization can be split into two levels of abstraction. The high-level is what fields you want to store, what their types are, etc. for a reasonably broad ontology of histograms. Since the boost-histogram Python library is limited to a finite set of instantiated C++ templates (which is an infinite set of histogram types, generated by a finite set of generators), the ontology can and should at least cover everything in the Python library. How much an ontology can cover all possible C++ Boost::Histograms is unclear, because these can include axis and storage types defined by users, and might need to cover the entire C++ language to do that. Boost::Serialization and ROOT dictionaries do cover almost all of C++, so one possible solution would be to just use Boost or ROOT I/O, though that makes things difficult on the Python side. Python users would want to deserialize histograms in a way that is usable in Python, which means identifying the finite set of axis and storage types that the boost-histogram library can use, so if C++ users go for full generality with Boost or ROOT I/O, Python users might end up wanting their own serialization to identify the parts they can use, which splits it into two formats.

The low-level format is how the set of typed field values map to bytes on disk (or bytes over the network, etc.). For ROOT I/O, this is already specified: TStreamerInfo. For Aghast, the high-level was the Flatbuffers schema and the low-level was a Flatbuffers serialization. (I have to use the indefinite article, "a Flatbuffers serialization" because it's less determined by Flatbuffers than you might think. Flatbuffers readers are automatic, but Flatbuffers writers have to choose which data's bytes are adjacent to which—more freedom than I wanted while writing Aghast.) The low-level format could even be JSON. Case in point: even ROOT objects can be serialized as JSON, which I used to test the high-level logic of ROOT I/O independently of the low-level.

The more important part, which will take more deliberation, is the high-level ontology. If you decide on a high-level description of histograms that can be presented as a JSON schema, Avro schema, Thrift schema, Flatbuffers schema, etc., then there's nothing preventing you from encoding those high-level descriptions in a binary format and human-readable text (JSON, YAML, FITS, ASDF, etc.). We like to avoid proliferation, of course, but the worst part of proliferation is when two schemas spell a field name different ways, not when one process uses Avro's binary format and another uses Avro's JSON format, for instance. These are both recognized modes of the format and there are converters between them.

Second point: histograms have a natural split between metadata and numerical data. The metadata encodes the types of axes, types of storage, names, axis ranges, etc. The numerical data are the bin contents. It's only unclear how to categorize a few pieces of information, such as bin borders of irregular axes or which bins exist in sparse axes. For the sake of definitions, let's say that everything that scales as O(1) in the number of bins is "metadata" and everything that does not (usually O(n), where n is the number of bins, or O(n1 × n2 × ...), where n1, n2, etc. are bins along each axis) is "numerical data." (Thus, bin borders of irregular axes and which bins exist in sparse axes are numerical data, not metadata.) The metadata is the more complex part, but (in the limit of large n) its size is negligible. It is also the part that users might want to access in a human-readable way. In the Aghast specification, the numerical data were the *Buffer objects and the metadata were everything else. (The separation could have been better: that's a criticism, in retrospect.)

If you try to put both the metadata and numerical data into a JSON format, the best you can do for the numerical part is a compressed, Base64-encoded string, which (as you've pointed out, @HDembinski) increases its size by 33%. Moreover, I'd add that including the numerical data in the human-readable stream obscures the human-readable part by adding a lot of noise that has to be scrolled past (if an editor can even load the file). Mixing metadata and numerical data in JSON is not a good option—not something I'd recommend.

In NHEP, we're accustomed to all data being binary, but the astronomers I've talked to wouldn't even consider all-binary files, despite the fact that they routinely work with large images and simulations. The FITS (old) and ASDF (new) formats do this by putting the human-readable metadata at the top of the file, where it can be easily inspected, and the binary data (usually images or tabulated objects) is down below. I'm not saying that we should adopt FITS or ASDF, but that the principle of separating metadata and numerical data is a good one. With axis and storage specifiers separated from the bin contents, comprising an O(1) header on a big dataset of size n, the choice of metadata encoding becomes irrelevant for performance, but making it human-readable adds a useful feature in interpretation. There will be standard readers, of course, but you don't have to rely on standard readers.

Thus, you can have your cake ("fast, efficient, and opaque") and eat it, too ("slow, inefficient, and transparent"), as long as you make the O(n) numerical data fast and efficient and the metadata inefficient and transparent. This is, incidentally, exactly how Awkward Array works: the metadata and all O(1) operations are in Python because that makes it easy and the inefficiency of Python is irrelevant for O(1) things. The numerical data and all O(n) and O(n log n) operations are in precompiled routines because that's where the performance matters. It is a useful division to make, across applications: GPU programming works this way, too: only the "flat," scaling part of the problem is sent to the GPU. It's the reason why columnar formats are easily interconvertible: Awkward Array and Apache Arrow have totally different metadata, but nearly the same numerical data. Converting the metadata is O(1), and most of the numerical columns can be zero-copy views. The Parquet file format uses Thrift for its metadata and custom binary for its numerical data. In all of these applications, the separation of the data description into complex but small metadata and simple but large numerical data is essential.

You can put this principle to work for histogram serialization by

identifying the information about a histogram or collection of histograms that scales as O(1) where n is the number of bins and describing this in a JSON-like hierarchy
isolating the rest in one or more rectilinear arrays.

In figuring out the high-level description, you can use JSON and NumPy arrays as a placeholder. Whatever you use in the low-level encoding would be equivalent to these two. For instance, if you end up putting these in HDF5 files, the JSON is the Dataset metadata (a string) and the array is the Dataset itself. If you end up using a binary encoding for the metadata anyway, at least use something standard, like Thrift (as Parquet does) or Avro (my favorite, has an alternate JSON form). I've mentioned that, as a low-level encoding, the ZIP file format has all the features you need: histograms would be individually readable, binary blobs, and ZIP is lightweight and ubiquitous.

But get the high-level description figured out, with a good separation between metadata and data, and the preferred low-level encoding(s) can be decided later.

We are storing LHC data longterm in binary format. This "intransparent" format is not so intransparent, because Jim figured out how to read/write it.

It would have been a thousand times easier if the data were not binary, and Uproot still doesn't handle every case correctly: people still find ROOT files in which a byte of header here and there needs to be skipped for reasons I don't understand. Having a more complete specification would also have helped, but seriously: if the metadata parts, like TKey headers, histogram metadata, TTree metadata, etc. had been text or a standard encoding like XDR ("Thrift of the 90's"), it would have been much, much easier, regardless of whether the format were well-specified or not. Reverse-engineering and even following a spec in a human-readable language is totally different from detective work on individual bytes.

Also, Sebastien Binet did most of that work: I mostly read the go-hep code.

0 replies

henryiii · 2022-03-29T19:06:40Z

henryiii
Mar 29, 2022
Maintainer

I'm going to put in my comments from the thread, as well. Hans mentioned Flatbuffers; they look rather nice - they have the option to output JSON too (at least from C++), so as long as we make sure that’s readable too, it might be a nice dual-purpose format.

There are two competing groups here - one group wants fast, performant storage, and doesn’t care if it’s opaque. The other wants highly portable, user inspectable storage for archival and transfer. I’d argue we already have fast and performant storage, it’s just C++ only or Python only - the only “new” thing is allowing those two platforms to share stored histograms. Which is nice. People in this group are likely generating large numbers of (or large) histograms.

The thing we don’t have at all today is what the other group is asking for - some clearly documented format that all our tools can read and write, and other tools can read and write too without having to depend on Boost.Histogram or boost-histogram. Due to the amount of interest in JSON, tools like SIMDJSON can read it really quickly, and nobody really cares about the uncompressed size of files much anyway - any binary format still needs to be compressed, and text files & binary formats compress down to something not that horribly different anyway. Textual formats are really attractive to groups interested in archival storage, cross-tool histogram transfer, HEPData, Web APIs - places where small numbers of (or small) histograms are important.

But if done carefully, I think flatbuffers might be able to appease both camps. Will have to play with them. The other solution that might appease both camps is some binary format that is easy to inspect via tools, like HDF5 (which can also internally compress, which is nice). Or we could simply provide two solutions - a binary format and a standardized interchange format (which is sort of the flatbuffers idea too, you would chose a binary format vs. JSON for output). Currently the main demeans is for the interchange format, by the way - the only thing we don't have for the first group with the current format is the ability to get boost-histogram and Boost.Histogram to share files. We have nothing for the second group except limited conversion to a ROOT file via via uproot.

To be clear, a binary format can be highly inspectable; examples include hdf5 and sqlite. Extensive tooling has been developed to make those transparent. The same is (now) true for ROOT, but it's been a tremendous amount of work. If we provide a binary format that is supposed to appease the second group, it needs:

A reader that does not depend on Boost-histogram or compiled code. If we want it to be cross-library, we need to have a reference implementation that's not Boost.Histogram/boost-histogram. UHI could be a place to put this.
A clear specification. We can't "hide" the structure of complex storages - we have to explain exactly how it is stored and read. Otherwise, no one will be able to read histograms without boost-histogram/Boost.Histogram and a C++14 compiler. I'd put the spec in UHI too.
A way to inspect the raw information fairly easily. This is "free" with HDF5 and some other formats, and if it's possible in a few lines of Python, etc, that's enough. We need an example for how to inspect a file.

This is why people still use CSV for pandas data frames when there are better formats available - it's often enough, and it's completely trivial to implement. HEPData supports JSON, YAML, CSV, ROOT, and YODA - the only binary format there is ROOT.

A possible solution could be to support an existing standard (YODA?) + a flat buffer binary format. Or something like that.

0 replies

kratsg · 2022-03-30T21:10:09Z

kratsg
Mar 30, 2022
Maintainer

I've found pickle to be pretty unreliable when it comes to larger file sizes -- and have tried to use other solutions like https://docs.python.org/3/library/shelve.html. What are the considerations with these options? E.G. ATLAS has typically limited ROOT file sizes to no more than 10GB.

For proof of concept - I think it's fine to start with B-H -- but it would be wise to make sure anything developed doesn't get tied into a specific choice of serialization library.

0 replies

HDembinski · 2022-04-03T14:34:11Z

HDembinski
Apr 3, 2022
Maintainer Author

It seems can narrow down what our requirements are. I think, we agree on the following.

Serialization between Boost.Histogram in C++ and Python needs to be possible

We need a serialization format that supports the subset of Boost.Histogram classes that are implemented in both C++ and Python. The format should naturally allow users of C++ Boost.Histogram to write additional axes, storages, accumulators and serialize them. They won't be readable in Python until those classes are also implemented in Python. This is fine. Most users just use the standard builtin histogram components which are almost all shared between C++ and Python.

The format should be fully specified so that the low-level representation can be read by other libraries

The spec can then be included in UHI as Henry said. If the format is fully specified, it does not need to be ASCII. A good compromise is to use a binary format but with some metadata attached that allows to interpret the low-level presentation as numbers, strings, arrays of numbers etc. ROOT or HDF5 are suitable candidates for storing such a well-specified low-level representation.

We want one primary serialization format, optionally several secondary formats

The low-level representation of the histograms could be stored in different formats, including ASCII. Basically, any format allows to write floats, integers, and arrays of those, is a possible backend. Nevertheless, we need to converge on a default format that fits the standard use case of our users. This format should be binary instead of ASCII.

ASCII formats are good for configuration files, any data that needs to be written and edited frequently by humans. Histograms are not configuration data and should not be edited by humans. Boost.Histogram was designed around maximum performance, which you get by using clever algorithms, but primarily by avoiding unnecessary work. The conversion from binary to ASCII and back is expensive. A binary format allows one to write arrays of numbers directly to disk without converting the bits. The performance is not important for the histogram metadata, which is small, but it is important for the arrays. These can be large, Boost.Histogram supports and encourages the use of very high-dimensional histograms and those are regularly used in HEP analyses. These histograms can reach sizes of GB or more.

0 replies

HDembinski · 2022-04-03T14:46:24Z

HDembinski
Apr 3, 2022
Maintainer Author

I have some additional requirements in addition.

Keep serialization code that we already have

Boost.Histogram in C++ and Python currently uses Boost.Serialization, which consists of a protocol to turn the actual objects of the library into a sequence of low-level primitives (strings, numbers, arrays of thereof). This protocol part is the most difficult thing to write and Boost.Serialization does a good job, it supports versioning and it is extensible. The protocol part is purely templated code, so it is agnostic to the library that does something with the low-level primitives. I exploit this in boost-histogram in Python, where I wrote a backend a from scratch that simply converts the low-level primitives into Python objects, which are then finally serialized with Python's pickle framework.

Important consequence: In the Boost.Serialisation framework, one cannot use two different formats for the low-level primitives. It is not possible store the histogram metadata in JSON and the large arrays in binary with this approach.

Do not use two formats when one suffices

In addition to the previous point, it is goes against the principle of simplicity to use two backend formats when one suffices. We should use the same low-level format to write axis data and storage data, not two different formats.

0 replies

HDembinski · 2022-04-03T14:54:59Z

HDembinski
Apr 3, 2022
Maintainer Author

My non-starters

For the reasons laid out above, non-starters for me are ASCII formats or mixed formats. We need to chose a binary format.
A serialisation that does not use the Boost.Serialisation frontend is also a non-starter for me.

Things on which I am undecided

I don't have a strong preference of binary format, so I think we need to discuss the different options and then narrow it down based on our requirements.

I brought up flatbuffers, but perhaps the simplest for our field is to storge the low-level representation in ROOT. ROOT brings a lot of features to the table that we want, transparent compression, conversion of endianess. A browser. And it is a standard format in our field already. I don't like the ROOT framework for obvious reasons, but the ROOT data format is pretty good.

HDF5 is also a good option, it also has a browser and was designed with performance in mind. I would slightly prefer ROOT, since we are stuck in HEP with ROOT as a data-format anyway. The principle of economy suggests that we therefore use the same format and not another. We can read/write ROOT files without the ROOT framework now, which fixes the main caveat of the ROOT format.

Flexibility for evolution

I want to keep the Boost.Serialisation protocol, but I am not fixated on the current way how the library objects are converted into low-level primitives. Boost.Serialisation supports the evolution of the serialised format, we can change it for every class as we see fit. The only price to pay is that the old code has be kept around, so that a newer version of the library can still read older versions of the serialised format. This implies that we do not change the current format unless there is a good reason to minimize these costs, but I am open for evolution. The current protocol was not designed so that the low-level representation looks pretty, it was designed to be a direct representation of the in-memory objects as much as possible.

0 replies

HDembinski · 2022-04-03T15:26:44Z

HDembinski
Apr 3, 2022
Maintainer Author

How Boost.Serialisation works

There is a front-end, which I also call the protocol, and a backend. The front-end is responsible for turning the library objects into low-level primitives. The backend is responsible for turning low-level primitives into bit streams or other formats. This separation is great, because it allows for great flexibility on the backend. Boost.Serialisation has builtin support to write binary and xml, for example. Cereal is a Boost.Serialization fork/rewrite which supports binary, XML, and JSON.

It should be straight-forward to make a backend that writes low-level primitives into a ROOT file or into a HDF5 file. Someone made a hdf5 archive a while ago. For boost-histogram, I wrote a backend that converts low-level primitives into Python primitives. This backend was written without actually using the Boost.Serialisation library. I implemented the parts of the library that we needed from scratch.

The front-end optionally annotates the low-level primitives with names. These names are used by more complex backends like JSON or XML to store name-value pairs instead of just a sequence of values. Boost.Histogram uses this feature to annotate most data.

Every class in Boost.Histogram that needs to be serialised has a templated function serialize. For the histogram it looks like this

template <class Archive>
  void serialize(Archive& ar, unsigned /* version */) {
    detail::axes_serialize(ar, axes_);             // calls into another function to serialize the axis objects
    ar& make_nvp("storage", storage_);             // implicitly calls storage_.serialize(ar, ...);
    if (Archive::is_loading::value) {              // compute a value on load which only exists in memory
      offset_ = detail::offset(axes_);
      detail::throw_if_axes_is_too_large(axes_); // as the name says
    }
  }

Archive represents the backend. It is a class which supports this syntax ar & make_nvp("some-name", some_cpp_primitive) implemented with a custom operator&. make_nvp stands for make-name-value pair. This is the annotation that a backend can use (or ignore) to optionally produce nice name-value pairs or add the name as metadata to the value. To accelerate writing arrays of primitives, the backend should further support ar.save_array and ar.load_array.

As the author of a backend, one has to write two classes, a reader and a writer. Both need to implement reading and writing the various primitives, numbers, strings, arrays of numbers, etc. They don't need to know anything else. For example, they don't need to know how to serialise std::unique_ptr; that is handled by the frontend. This system is brilliant and effective, and the result of intense discussions among technical experts in the Boost review process.

In other words, on write, the protocol effectively generates a tree of primitives, which are then written sequentially by the writer backend. This tree is defined at compile-time by the code and is never instantiated in memory. If the backend requires a conversion, only one converted primitive is alive at any given time, which is the most efficient solution.

On read, the protocol restores the high-level object from the tree of primitives. That fields are read from the backend in the right order is guaranteed by the front-end. The order in which low-level primitives are written to the backend matters and cannot be changed by a user, unless all leaves are tagged as name-value pairs. In that case and if the backend is actually written to support look-up by name, named values can also be reordered on disk by external code.

The fact that the same framework allows one to use both approaches is an advantage of Boost.Serialisation. People who want to remove every unnecessary bit from the serialisation format can do so by ignoring the names of name-value pairs and rely on the strict writing/reading order. People who want name-value pairs for better readability of the serialisation format can have that as well.

0 replies

davo417 · 2023-06-08T19:05:18Z

davo417
Jun 8, 2023

Hello, I'm recently experimenting with Boost.Histogram and how it integrates into HEP analyses. The Pythia8+Boost.Histogram+Boost.MPI combination looks really good and I love being as far as I can from ROOT, but when I have to finally dump my histograms into a presentation I only have ROOT as an option in the C++ realm.
That said, I prefer python for the rest of the analysis process, but I can't find an easy way of filling my histograms in the event loop and save them in a format that a python script can later read and process.

I've read the discussion and look at the custom serialization backend mentioned by @HDembinski but I can't figure out how exactly enable c++ to python data exchange. Maybe I'm missing something important here.

In the past I've just dump more or less raw data from the event generator and do the filling completely in python, and I'm fine with that thanks to libraries like this, but If I only need the histogram it seems a waste to store intermediate data.

It would be great if I could generate my envents and fill some histograms in c++, and continue from there in python.

1 reply

HDembinski Jun 8, 2023
Maintainer Author

Really nice to hear that you like Boost.Histogram.

Regarding serialisation from C++, you can use Boost.Serialization or the external but compatible cereal library. There is no library to save Boost.Histograms in ROOT files, although that would be great to have.

I personally stopped doing analysis in C++ altogether and do all my work in Python. To run Pythia8 efficiently from Python, you can use chromo. For high performance processing in Python you can use Numba. The Python frontend for Boost.Histogram allows you to save histograms from Python with the pickle format, which is quite fast and efficient if you use the binary format and the gzip module to transparently compress the pickle.

jonas-eschle · 2023-07-25T17:49:55Z

jonas-eschle
Jul 25, 2023
Maintainer

Thanks for this discussion, we got also quite interested in the serialization of histograms.
We're developing the HEP statistics serialization standard that should be a human (and machine) readable format to load and dump statistial objects such as likelihoods.

PDFs and parameters are clear but the data becomes more tricky. Storing a histogram would be an essential part of it.

For the reasons laid out above, non-starters for me are ASCII formats or mixed formats. We need to chose a binary format.

This would actually be the format that we would look for: JSON (pure) or with binary mixed in (as outlined by @jpivarski , asdf seems to be a great candidate and we're already using it with unbinned data).
A pure binary format is - and here I fully agree with @HDembinski - of course preferrable for many applications. But HS3 has a different target.

Could you imagine some mixture? Say having a "JSON" like output as well as a "BSON"?
Or do you have another idea how to best go about histograms in HS3? (as a sidenote, pyhf uses currently a pure JSON file to have histograms, maybe they can tell more about it)

2 replies

HDembinski Jul 26, 2023
Maintainer Author

People who suggest to have JSON histograms think of histograms as small. They probably only have experience with the usuall 100 bin 1D histograms. But we cannot limit ourselves to that. High-dimensional histograms can be huge, Gigabytes in size. At this size, read/write performance and size on disk become noticable issues.

boost-histogram supports really large histograms, and these are extensively used in real analyses.

Bottom-line: Pure JSON is unacceptable for that reason. Using JSON for the histogram configuration and binary for the data is a maintenance burden, because one has two different formats to deal with.

BSON would be an acceptable compromise, because it gives us both. I am still not in favor of it, because I fail to see the benefit in being able to open a histogram dump with a text editor. With BSON you only get a part that you can read, while the binary blob is still unreadable (it must be for efficiency reasons).

How fast is reading/writing a blob with 1 GB of size in BSON compared to a plain binary dump?

jonas-eschle Aug 2, 2023
Maintainer

I do agree with your points, and it goes in hand with other requirements in HS3. i.e. how to store unbinned data? (not for hists but for unbinned fits etc).
What the though (and see also the PR you've already commented) is to mix it: this allows to have a pure JSON but it can also have bytestreams as a value and link it (I think @jpivarski had some more detailed ideas on this?).
Would that sound reasonable?

Also, the need for "human readable", or text readable, is probably not only for humans but (maybe) creates a better layer of compatibility compared to binary formats

jpivarski · 2023-08-02T19:10:47Z

jpivarski
Aug 2, 2023
Maintainer

I'll make my suggestions as a top-level comment, all in one place, rather than piecemeal.

No one should be suggesting all-JSON formats for histograms. As @HDembinski points out, people who make such suggestions just don't know how large histograms can become. The absolute best you can do to pack binary data into a single JSON string is Base-122, which throws away 12.5% efficiency but isn't standard, or Base-64, which throws away 25% efficiency, but is standard. Moreover, if you have encoded (probably also compressed) data in strings, the human-readability argument goes out the window (unless you're reading it in a GUI that unfolds the JSON as needed).

On the other hand, the metadata for a histogram is complex and would have rich structure that I'd want to explore by eye and programmatically as a data structure. I think the astronomy community has the right idea about the value of mixed text/binary formats (text at the top, where it's easy to scroll through in an editor; binary at the bottom). FITS is an old format with this feature. Its kluginess is a product of its age (43 years!), but the astronomers I've talked to say that the text/binary mixture has been a very good experience and consider it an absolute requirement for new formats, such as ASDF, which is YAML at the top, binary at the bottom. The only reason why I'd be hesitant about using ASDF for histograms is because its C++ development has apparently stalled.

A pure binary format means that you'd always need a reader, which can be okay if they're easy to come by. Now the problem is that we have too many choices: there are a lot of binary, JSON-like formats, one for each combination of needing a schema, whether the schema is shipped with the data, how much is dynamically typed and how much is encoded in the schema, and how much can be read without reading everything. For example, BSON does not have a schema, so the keys in key-value pair mappings are all encoded as strings (no benefit from being binary). Using BSON, one would need to be careful not to define Storage serializations like

"storage": {"type": "Weight", "data": [{"value": 3.14, "weight": 1.1}, ...]}

because the "value" and "weight" strings would appear in the serialization hundreds/thousands/millions of times. The onus would be on the histogram schema to make them columnar, like

"storage": {"type": "Weight", "data": {"values": [3.14, 2.71, ...], "weights": [1.1, 0.9, ...]}}

(Of course, that's doable. I'm just pointing out the gotcha.) I have a personal favorite binary, JSON-like format, Avro, which has a schema that gets embedded in the file and every data object has both a binary form and a JSON form, so people with small histograms still have a way to dump them and look at them by eye in a text editor. (When my primary intention is to answer some question, rather than develop a software library, I would do stuff like that, too, even if it means dumping GBs in my /tmp directory.)

So we'd have to choose a binary serialization format. Maybe we could put that to a random number generator, just to get unstuck from the choice paralysis.

However, beyond that, there's still another thing that I would consider desirable: being able to access some fields in the metadata without reading, decompressing, or interpreting all of the bin data. Histograms can be big, but there can also be a lot of them. It would be very common to want to scan through a million histograms to verify that they all have the same binning and can therefore be merged, before loading any of the bin values. Aghast tried to address this with Flatbuffers, which use indirection to allow everything to be load-on-demand (you can load any field without loading all of the fields), but my experience with that is that Flatbuffers underspecifies how data are laid out in the buffer. The programmer using Flatbuffers has to make a lot of decisions (perhaps to allow microoptimizations of which fields are in the same CPU cache lines...).

The fine granularity of reading each field independently of every other field in Flatbuffers is overkill for this purpose. It would be sufficient to have two reading stages: (1) read all the metadata and (2) read all the bin contents (and maybe put irregular edges in the second stage, too). This is where the idea of having two blobs comes in (or $N$ blobs, if irregular edges are included, or complex Storages are broken down into columns). There's an additional advantage of being able to leave the metadata blob as uncompressed JSON and compress the other blobs.

I should point out that ak.to_buffers/ak.from_buffers works this way, and this has been a great abstraction, in the sense that it has solved problems we didn't originally intend it for. In general, any Awkward Array can be broken down into one JSON object and $N$ binary blobs, all with simple data types (NumPy dtypes, without internal structure):

>>> import awkward as ak
>>> array = ak.Array([
...     [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
...     [],
...     [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
... ] * 1000000)

>>> print(f'{{"length": {length}, "form": {form.to_json()}}}')
{"length": 3000000, "form": {"class": "ListOffsetArray", "offsets": "i64", "content": {"class": "RecordArray",
"fields": ["x", "y"], "contents": [{"class": "NumpyArray", "primitive": "float64", "inner_shape": [], "parameters":
{}, "form_key": "node2"}, {"class": "ListOffsetArray", "offsets": "i64", "content": {"class": "NumpyArray",
"primitive": "int64", "inner_shape": [], "parameters": {}, "form_key": "node4"}, "parameters": {}, "form_key":
"node3"}], "parameters": {}, "form_key": "node1"}, "parameters": {}, "form_key": "node0"}}

>>> blobs
{
    'node0-offsets': array([      0,       3,       3, ..., 4999998, 4999998, 5000000]),
    'node2-data': array([1.1, 2.2, 3.3, ..., 3.3, 4.4, 5.5]),
    'node3-offsets': array([       0,        1,        3, ..., 14999991, 14999995, 15000000]),
    'node4-data': array([1, 1, 2, ..., 3, 4, 5]),
}

>>> ak.from_buffers(form, length, blobs)
<Array [[{x: 1.1, y: [1]}, ..., {...}], ...] type='3000000 * var * {x: floa...'>

Being able to separate the small, richly-typed metadata from the large, simply-typed binary blobs is a great building block for making formats. For instance, we can put these things in HDF5, use them as a pickle format, and as a way to hand data over the border between C and Python: you only need a string (JSON) for the small, richly-typed metadata and a mapping from names to pointers for the large, simply-typed binaries.

But this is a building block, not a whole format. Another desirable feature is the ability to put these things in existing container formats, with more than one histogram per container. ROOT users have benefited from the ability to make collections of histograms, sometimes organized in directories, sometimes accompanying non-histogram data, all in a single file that can be emailed (or CERNBoxed, or whatever). For containers, ROOT, HDF5, and ZIP all seem like good candidates, and if the to_buffers/from_buffers logic is separate from the container format, then it's not a big cost to provide all of the options.

So it's a two-step serialization:

graph TD;
  A(histogram in memory)-->B(metadata object and named binary blobs);
  B-->C(physical file: ROOT, HDF5, ZIP);

For ROOT, the natural choice for a binary blob is a TBasket, and the "name" in this case would be a seek location. The rich metadata would be more natural as class instances than JSON, since ROOT has a lot of tools for inspecting them (and Uproot should automatically be able to read them, if they're not too complicated). A ROOT user would see a histogram as an object in a directory, and (unlike TH*) be able to read the metadata of many histograms without the cost of loading and decompressing all the bin contents. Just as a TTree delays the reading of its TBaskets, the boost-histograms would delay the reading of their binary data.

For HDF5, the natural choice to hold everything for one histogram is a Group, with the rich metadata as a JSON-valued attribute for that Group.

For ZIP, all of the binary blobs, including the metadata, would be files within the ZIP archive. All the data for one histogram can be grouped in a directory (the directory name is the histogram name), and the metadata can have a special name within the directory.

To wrap this up¹, it's precisely because of the fact that histograms are big, and also because there are often many of them, that I'm suggesting a split between the metadata and the bins/edges. Once you have that split, it becomes reasonable to consider multiple container formats, just as the to_buffers/from_buffers functionality in Awkward Array has been more useful than originally envisaged.

I hadn't noticed that this is an old thread and I said a lot of the same things over a year ago. At least I'm consistent, I guess! ↩

1 reply

HDembinski Aug 4, 2023
Maintainer Author

The absolute best you can do to pack binary data into a single JSON string is Base-122, which throws away 12.5% efficiency but isn't standard, or Base-64, which throws away 25% efficiency, but is standard. Moreover, if you have encoded (probably also compressed) data in strings, the human-readability argument goes out the window (unless you're reading it in a GUI that unfolds the JSON as needed).

In addition to the size bloat there is a large overhead in computation cost. Converting binary floats to ASCII back and forth already adds a significant cost, especially when reading, because you need to run a parser over the ASCII. Reading/writing binary is super fast in comparison, in the ideal case there is no conversion at all and bits are just copied from one location to another. I don't have much experience with Base-N. I think the parsing is avoided there, but since it is not human-readable, it defeats the purpose of an ASCII format anyway.

I largely argee with the rest that Jim wrote. I am not a fan of this split into a metadata object and separate binary blobs, but I am warming up to the idea as it seesm to be the best compromise that we can reach. I like the idea also of having a primitive intermediate format which can then be mapped more easily to different backends.

henryiii · 2023-08-02T19:16:25Z

henryiii
Aug 2, 2023
Maintainer

See scikit-hep/uhi#105 for the proposed standard.

0 replies

Common serialisation format #726

HDembinski Mar 29, 2022 Maintainer

Replies: 11 comments · 4 replies

jpivarski Mar 29, 2022 Maintainer

henryiii Mar 29, 2022 Maintainer

kratsg Mar 30, 2022 Maintainer

HDembinski Apr 3, 2022 Maintainer Author

Serialization between Boost.Histogram in C++ and Python needs to be possible

The format should be fully specified so that the low-level representation can be read by other libraries

We want one primary serialization format, optionally several secondary formats

HDembinski Apr 3, 2022 Maintainer Author

Keep serialization code that we already have

Do not use two formats when one suffices

HDembinski Apr 3, 2022 Maintainer Author

My non-starters

Things on which I am undecided

Flexibility for evolution

HDembinski Apr 3, 2022 Maintainer Author

How Boost.Serialisation works

davo417 Jun 8, 2023

HDembinski Jun 8, 2023 Maintainer Author

jonas-eschle Jul 25, 2023 Maintainer

HDembinski Jul 26, 2023 Maintainer Author

jonas-eschle Aug 2, 2023 Maintainer

jpivarski Aug 2, 2023 Maintainer

Footnotes

HDembinski Aug 4, 2023 Maintainer Author

henryiii Aug 2, 2023 Maintainer

HDembinski
Mar 29, 2022
Maintainer

Replies: 11 comments 4 replies

jpivarski
Mar 29, 2022
Maintainer

henryiii
Mar 29, 2022
Maintainer

kratsg
Mar 30, 2022
Maintainer

HDembinski
Apr 3, 2022
Maintainer Author

HDembinski
Apr 3, 2022
Maintainer Author

HDembinski
Apr 3, 2022
Maintainer Author

HDembinski
Apr 3, 2022
Maintainer Author

davo417
Jun 8, 2023

HDembinski Jun 8, 2023
Maintainer Author

jonas-eschle
Jul 25, 2023
Maintainer

HDembinski Jul 26, 2023
Maintainer Author

jonas-eschle Aug 2, 2023
Maintainer

jpivarski
Aug 2, 2023
Maintainer

HDembinski Aug 4, 2023
Maintainer Author

henryiii
Aug 2, 2023
Maintainer