Common serialisation format #726
Replies: 11 comments 4 replies
-
I'll be repeating a little bit of what I said in the email thread, but I'm getting it in public here. First point: histogram serialization can be split into two levels of abstraction. The high-level is what fields you want to store, what their types are, etc. for a reasonably broad ontology of histograms. Since the boost-histogram Python library is limited to a finite set of instantiated C++ templates (which is an infinite set of histogram types, generated by a finite set of generators), the ontology can and should at least cover everything in the Python library. How much an ontology can cover all possible C++ Boost::Histograms is unclear, because these can include axis and storage types defined by users, and might need to cover the entire C++ language to do that. Boost::Serialization and ROOT dictionaries do cover almost all of C++, so one possible solution would be to just use Boost or ROOT I/O, though that makes things difficult on the Python side. Python users would want to deserialize histograms in a way that is usable in Python, which means identifying the finite set of axis and storage types that the boost-histogram library can use, so if C++ users go for full generality with Boost or ROOT I/O, Python users might end up wanting their own serialization to identify the parts they can use, which splits it into two formats. The low-level format is how the set of typed field values map to bytes on disk (or bytes over the network, etc.). For ROOT I/O, this is already specified: TStreamerInfo. For Aghast, the high-level was the Flatbuffers schema and the low-level was a Flatbuffers serialization. (I have to use the indefinite article, "a Flatbuffers serialization" because it's less determined by Flatbuffers than you might think. Flatbuffers readers are automatic, but Flatbuffers writers have to choose which data's bytes are adjacent to which—more freedom than I wanted while writing Aghast.) The low-level format could even be JSON. Case in point: even ROOT objects can be serialized as JSON, which I used to test the high-level logic of ROOT I/O independently of the low-level. The more important part, which will take more deliberation, is the high-level ontology. If you decide on a high-level description of histograms that can be presented as a JSON schema, Avro schema, Thrift schema, Flatbuffers schema, etc., then there's nothing preventing you from encoding those high-level descriptions in a binary format and human-readable text (JSON, YAML, FITS, ASDF, etc.). We like to avoid proliferation, of course, but the worst part of proliferation is when two schemas spell a field name different ways, not when one process uses Avro's binary format and another uses Avro's JSON format, for instance. These are both recognized modes of the format and there are converters between them. Second point: histograms have a natural split between metadata and numerical data. The metadata encodes the types of axes, types of storage, names, axis ranges, etc. The numerical data are the bin contents. It's only unclear how to categorize a few pieces of information, such as bin borders of irregular axes or which bins exist in sparse axes. For the sake of definitions, let's say that everything that scales as O(1) in the number of bins is "metadata" and everything that does not (usually O(n), where n is the number of bins, or O(n1 × n2 × ...), where n1, n2, etc. are bins along each axis) is "numerical data." (Thus, bin borders of irregular axes and which bins exist in sparse axes are numerical data, not metadata.) The metadata is the more complex part, but (in the limit of large n) its size is negligible. It is also the part that users might want to access in a human-readable way. In the Aghast specification, the numerical data were the *Buffer objects and the metadata were everything else. (The separation could have been better: that's a criticism, in retrospect.) If you try to put both the metadata and numerical data into a JSON format, the best you can do for the numerical part is a compressed, Base64-encoded string, which (as you've pointed out, @HDembinski) increases its size by 33%. Moreover, I'd add that including the numerical data in the human-readable stream obscures the human-readable part by adding a lot of noise that has to be scrolled past (if an editor can even load the file). Mixing metadata and numerical data in JSON is not a good option—not something I'd recommend. In NHEP, we're accustomed to all data being binary, but the astronomers I've talked to wouldn't even consider all-binary files, despite the fact that they routinely work with large images and simulations. The FITS (old) and ASDF (new) formats do this by putting the human-readable metadata at the top of the file, where it can be easily inspected, and the binary data (usually images or tabulated objects) is down below. I'm not saying that we should adopt FITS or ASDF, but that the principle of separating metadata and numerical data is a good one. With axis and storage specifiers separated from the bin contents, comprising an O(1) header on a big dataset of size n, the choice of metadata encoding becomes irrelevant for performance, but making it human-readable adds a useful feature in interpretation. There will be standard readers, of course, but you don't have to rely on standard readers. Thus, you can have your cake ("fast, efficient, and opaque") and eat it, too ("slow, inefficient, and transparent"), as long as you make the O(n) numerical data fast and efficient and the metadata inefficient and transparent. This is, incidentally, exactly how Awkward Array works: the metadata and all O(1) operations are in Python because that makes it easy and the inefficiency of Python is irrelevant for O(1) things. The numerical data and all O(n) and O(n log n) operations are in precompiled routines because that's where the performance matters. It is a useful division to make, across applications: GPU programming works this way, too: only the "flat," scaling part of the problem is sent to the GPU. It's the reason why columnar formats are easily interconvertible: Awkward Array and Apache Arrow have totally different metadata, but nearly the same numerical data. Converting the metadata is O(1), and most of the numerical columns can be zero-copy views. The Parquet file format uses Thrift for its metadata and custom binary for its numerical data. In all of these applications, the separation of the data description into complex but small metadata and simple but large numerical data is essential. You can put this principle to work for histogram serialization by
In figuring out the high-level description, you can use JSON and NumPy arrays as a placeholder. Whatever you use in the low-level encoding would be equivalent to these two. For instance, if you end up putting these in HDF5 files, the JSON is the Dataset metadata (a string) and the array is the Dataset itself. If you end up using a binary encoding for the metadata anyway, at least use something standard, like Thrift (as Parquet does) or Avro (my favorite, has an alternate JSON form). I've mentioned that, as a low-level encoding, the ZIP file format has all the features you need: histograms would be individually readable, binary blobs, and ZIP is lightweight and ubiquitous. But get the high-level description figured out, with a good separation between metadata and data, and the preferred low-level encoding(s) can be decided later.
It would have been a thousand times easier if the data were not binary, and Uproot still doesn't handle every case correctly: people still find ROOT files in which a byte of header here and there needs to be skipped for reasons I don't understand. Having a more complete specification would also have helped, but seriously: if the metadata parts, like TKey headers, histogram metadata, TTree metadata, etc. had been text or a standard encoding like XDR ("Thrift of the 90's"), it would have been much, much easier, regardless of whether the format were well-specified or not. Reverse-engineering and even following a spec in a human-readable language is totally different from detective work on individual bytes. Also, Sebastien Binet did most of that work: I mostly read the go-hep code. |
Beta Was this translation helpful? Give feedback.
-
I'm going to put in my comments from the thread, as well. Hans mentioned Flatbuffers; they look rather nice - they have the option to output JSON too (at least from C++), so as long as we make sure that’s readable too, it might be a nice dual-purpose format. There are two competing groups here - one group wants fast, performant storage, and doesn’t care if it’s opaque. The other wants highly portable, user inspectable storage for archival and transfer. I’d argue we already have fast and performant storage, it’s just C++ only or Python only - the only “new” thing is allowing those two platforms to share stored histograms. Which is nice. People in this group are likely generating large numbers of (or large) histograms. The thing we don’t have at all today is what the other group is asking for - some clearly documented format that all our tools can read and write, and other tools can read and write too without having to depend on Boost.Histogram or boost-histogram. Due to the amount of interest in JSON, tools like SIMDJSON can read it really quickly, and nobody really cares about the uncompressed size of files much anyway - any binary format still needs to be compressed, and text files & binary formats compress down to something not that horribly different anyway. Textual formats are really attractive to groups interested in archival storage, cross-tool histogram transfer, HEPData, Web APIs - places where small numbers of (or small) histograms are important. But if done carefully, I think flatbuffers might be able to appease both camps. Will have to play with them. The other solution that might appease both camps is some binary format that is easy to inspect via tools, like HDF5 (which can also internally compress, which is nice). Or we could simply provide two solutions - a binary format and a standardized interchange format (which is sort of the flatbuffers idea too, you would chose a binary format vs. JSON for output). Currently the main demeans is for the interchange format, by the way - the only thing we don't have for the first group with the current format is the ability to get boost-histogram and Boost.Histogram to share files. We have nothing for the second group except limited conversion to a ROOT file via via uproot. To be clear, a binary format can be highly inspectable; examples include hdf5 and sqlite. Extensive tooling has been developed to make those transparent. The same is (now) true for ROOT, but it's been a tremendous amount of work. If we provide a binary format that is supposed to appease the second group, it needs:
This is why people still use CSV for pandas data frames when there are better formats available - it's often enough, and it's completely trivial to implement. HEPData supports JSON, YAML, CSV, ROOT, and YODA - the only binary format there is ROOT. A possible solution could be to support an existing standard (YODA?) + a flat buffer binary format. Or something like that. |
Beta Was this translation helpful? Give feedback.
-
I've found pickle to be pretty unreliable when it comes to larger file sizes -- and have tried to use other solutions like https://docs.python.org/3/library/shelve.html. What are the considerations with these options? E.G. ATLAS has typically limited ROOT file sizes to no more than 10GB. For proof of concept - I think it's fine to start with B-H -- but it would be wise to make sure anything developed doesn't get tied into a specific choice of serialization library. |
Beta Was this translation helpful? Give feedback.
-
It seems can narrow down what our requirements are. I think, we agree on the following. Serialization between Boost.Histogram in C++ and Python needs to be possibleWe need a serialization format that supports the subset of Boost.Histogram classes that are implemented in both C++ and Python. The format should naturally allow users of C++ Boost.Histogram to write additional axes, storages, accumulators and serialize them. They won't be readable in Python until those classes are also implemented in Python. This is fine. Most users just use the standard builtin histogram components which are almost all shared between C++ and Python. The format should be fully specified so that the low-level representation can be read by other librariesThe spec can then be included in UHI as Henry said. If the format is fully specified, it does not need to be ASCII. A good compromise is to use a binary format but with some metadata attached that allows to interpret the low-level presentation as numbers, strings, arrays of numbers etc. ROOT or HDF5 are suitable candidates for storing such a well-specified low-level representation. We want one primary serialization format, optionally several secondary formatsThe low-level representation of the histograms could be stored in different formats, including ASCII. Basically, any format allows to write floats, integers, and arrays of those, is a possible backend. Nevertheless, we need to converge on a default format that fits the standard use case of our users. This format should be binary instead of ASCII. ASCII formats are good for configuration files, any data that needs to be written and edited frequently by humans. Histograms are not configuration data and should not be edited by humans. Boost.Histogram was designed around maximum performance, which you get by using clever algorithms, but primarily by avoiding unnecessary work. The conversion from binary to ASCII and back is expensive. A binary format allows one to write arrays of numbers directly to disk without converting the bits. The performance is not important for the histogram metadata, which is small, but it is important for the arrays. These can be large, Boost.Histogram supports and encourages the use of very high-dimensional histograms and those are regularly used in HEP analyses. These histograms can reach sizes of GB or more. |
Beta Was this translation helpful? Give feedback.
-
I have some additional requirements in addition. Keep serialization code that we already haveBoost.Histogram in C++ and Python currently uses Boost.Serialization, which consists of a protocol to turn the actual objects of the library into a sequence of low-level primitives (strings, numbers, arrays of thereof). This protocol part is the most difficult thing to write and Boost.Serialization does a good job, it supports versioning and it is extensible. The protocol part is purely templated code, so it is agnostic to the library that does something with the low-level primitives. I exploit this in boost-histogram in Python, where I wrote a backend a from scratch that simply converts the low-level primitives into Python objects, which are then finally serialized with Python's pickle framework. Important consequence: In the Boost.Serialisation framework, one cannot use two different formats for the low-level primitives. It is not possible store the histogram metadata in JSON and the large arrays in binary with this approach. Do not use two formats when one sufficesIn addition to the previous point, it is goes against the principle of simplicity to use two backend formats when one suffices. We should use the same low-level format to write axis data and storage data, not two different formats. |
Beta Was this translation helpful? Give feedback.
-
My non-starters
Things on which I am undecidedI don't have a strong preference of binary format, so I think we need to discuss the different options and then narrow it down based on our requirements. I brought up flatbuffers, but perhaps the simplest for our field is to storge the low-level representation in ROOT. ROOT brings a lot of features to the table that we want, transparent compression, conversion of endianess. A browser. And it is a standard format in our field already. I don't like the ROOT framework for obvious reasons, but the ROOT data format is pretty good. HDF5 is also a good option, it also has a browser and was designed with performance in mind. I would slightly prefer ROOT, since we are stuck in HEP with ROOT as a data-format anyway. The principle of economy suggests that we therefore use the same format and not another. We can read/write ROOT files without the ROOT framework now, which fixes the main caveat of the ROOT format. Flexibility for evolutionI want to keep the Boost.Serialisation protocol, but I am not fixated on the current way how the library objects are converted into low-level primitives. Boost.Serialisation supports the evolution of the serialised format, we can change it for every class as we see fit. The only price to pay is that the old code has be kept around, so that a newer version of the library can still read older versions of the serialised format. This implies that we do not change the current format unless there is a good reason to minimize these costs, but I am open for evolution. The current protocol was not designed so that the low-level representation looks pretty, it was designed to be a direct representation of the in-memory objects as much as possible. |
Beta Was this translation helpful? Give feedback.
-
How Boost.Serialisation worksThere is a front-end, which I also call the protocol, and a backend. The front-end is responsible for turning the library objects into low-level primitives. The backend is responsible for turning low-level primitives into bit streams or other formats. This separation is great, because it allows for great flexibility on the backend. Boost.Serialisation has builtin support to write binary and xml, for example. Cereal is a Boost.Serialization fork/rewrite which supports binary, XML, and JSON. It should be straight-forward to make a backend that writes low-level primitives into a ROOT file or into a HDF5 file. Someone made a hdf5 archive a while ago. For boost-histogram, I wrote a backend that converts low-level primitives into Python primitives. This backend was written without actually using the Boost.Serialisation library. I implemented the parts of the library that we needed from scratch. The front-end optionally annotates the low-level primitives with names. These names are used by more complex backends like JSON or XML to store name-value pairs instead of just a sequence of values. Boost.Histogram uses this feature to annotate most data. Every class in Boost.Histogram that needs to be serialised has a templated function template <class Archive>
void serialize(Archive& ar, unsigned /* version */) {
detail::axes_serialize(ar, axes_); // calls into another function to serialize the axis objects
ar& make_nvp("storage", storage_); // implicitly calls storage_.serialize(ar, ...);
if (Archive::is_loading::value) { // compute a value on load which only exists in memory
offset_ = detail::offset(axes_);
detail::throw_if_axes_is_too_large(axes_); // as the name says
}
}
As the author of a backend, one has to write two classes, a reader and a writer. Both need to implement reading and writing the various primitives, numbers, strings, arrays of numbers, etc. They don't need to know anything else. For example, they don't need to know how to serialise In other words, on write, the protocol effectively generates a tree of primitives, which are then written sequentially by the writer backend. This tree is defined at compile-time by the code and is never instantiated in memory. If the backend requires a conversion, only one converted primitive is alive at any given time, which is the most efficient solution. On read, the protocol restores the high-level object from the tree of primitives. That fields are read from the backend in the right order is guaranteed by the front-end. The order in which low-level primitives are written to the backend matters and cannot be changed by a user, unless all leaves are tagged as name-value pairs. In that case and if the backend is actually written to support look-up by name, named values can also be reordered on disk by external code. The fact that the same framework allows one to use both approaches is an advantage of Boost.Serialisation. People who want to remove every unnecessary bit from the serialisation format can do so by ignoring the names of name-value pairs and rely on the strict writing/reading order. People who want name-value pairs for better readability of the serialisation format can have that as well. |
Beta Was this translation helpful? Give feedback.
-
Hello, I'm recently experimenting with Boost.Histogram and how it integrates into HEP analyses. The Pythia8+Boost.Histogram+Boost.MPI combination looks really good and I love being as far as I can from ROOT, but when I have to finally dump my histograms into a presentation I only have ROOT as an option in the C++ realm. I've read the discussion and look at the custom serialization backend mentioned by @HDembinski but I can't figure out how exactly enable c++ to python data exchange. Maybe I'm missing something important here. In the past I've just dump more or less raw data from the event generator and do the filling completely in python, and I'm fine with that thanks to libraries like this, but If I only need the histogram it seems a waste to store intermediate data. It would be great if I could generate my envents and fill some histograms in c++, and continue from there in python. |
Beta Was this translation helpful? Give feedback.
-
Thanks for this discussion, we got also quite interested in the serialization of histograms. PDFs and parameters are clear but the data becomes more tricky. Storing a histogram would be an essential part of it.
This would actually be the format that we would look for: JSON (pure) or with binary mixed in (as outlined by @jpivarski , asdf seems to be a great candidate and we're already using it with unbinned data). Could you imagine some mixture? Say having a "JSON" like output as well as a "BSON"? |
Beta Was this translation helpful? Give feedback.
-
I'll make my suggestions as a top-level comment, all in one place, rather than piecemeal. No one should be suggesting all-JSON formats for histograms. As @HDembinski points out, people who make such suggestions just don't know how large histograms can become. The absolute best you can do to pack binary data into a single JSON string is Base-122, which throws away 12.5% efficiency but isn't standard, or Base-64, which throws away 25% efficiency, but is standard. Moreover, if you have encoded (probably also compressed) data in strings, the human-readability argument goes out the window (unless you're reading it in a GUI that unfolds the JSON as needed). On the other hand, the metadata for a histogram is complex and would have rich structure that I'd want to explore by eye and programmatically as a data structure. I think the astronomy community has the right idea about the value of mixed text/binary formats (text at the top, where it's easy to scroll through in an editor; binary at the bottom). FITS is an old format with this feature. Its kluginess is a product of its age (43 years!), but the astronomers I've talked to say that the text/binary mixture has been a very good experience and consider it an absolute requirement for new formats, such as ASDF, which is YAML at the top, binary at the bottom. The only reason why I'd be hesitant about using ASDF for histograms is because its C++ development has apparently stalled. A pure binary format means that you'd always need a reader, which can be okay if they're easy to come by. Now the problem is that we have too many choices: there are a lot of binary, JSON-like formats, one for each combination of needing a schema, whether the schema is shipped with the data, how much is dynamically typed and how much is encoded in the schema, and how much can be read without reading everything. For example, BSON does not have a schema, so the keys in key-value pair mappings are all encoded as strings (no benefit from being binary). Using BSON, one would need to be careful not to define "storage": {"type": "Weight", "data": [{"value": 3.14, "weight": 1.1}, ...]} because the "storage": {"type": "Weight", "data": {"values": [3.14, 2.71, ...], "weights": [1.1, 0.9, ...]}} (Of course, that's doable. I'm just pointing out the gotcha.) I have a personal favorite binary, JSON-like format, Avro, which has a schema that gets embedded in the file and every data object has both a binary form and a JSON form, so people with small histograms still have a way to dump them and look at them by eye in a text editor. (When my primary intention is to answer some question, rather than develop a software library, I would do stuff like that, too, even if it means dumping GBs in my /tmp directory.) So we'd have to choose a binary serialization format. Maybe we could put that to a random number generator, just to get unstuck from the choice paralysis. However, beyond that, there's still another thing that I would consider desirable: being able to access some fields in the metadata without reading, decompressing, or interpreting all of the bin data. Histograms can be big, but there can also be a lot of them. It would be very common to want to scan through a million histograms to verify that they all have the same binning and can therefore be merged, before loading any of the bin values. Aghast tried to address this with Flatbuffers, which use indirection to allow everything to be load-on-demand (you can load any field without loading all of the fields), but my experience with that is that Flatbuffers underspecifies how data are laid out in the buffer. The programmer using Flatbuffers has to make a lot of decisions (perhaps to allow microoptimizations of which fields are in the same CPU cache lines...). The fine granularity of reading each field independently of every other field in Flatbuffers is overkill for this purpose. It would be sufficient to have two reading stages: (1) read all the metadata and (2) read all the bin contents (and maybe put irregular edges in the second stage, too). This is where the idea of having two blobs comes in (or I should point out that ak.to_buffers/ak.from_buffers works this way, and this has been a great abstraction, in the sense that it has solved problems we didn't originally intend it for. In general, any Awkward Array can be broken down into one JSON object and >>> import awkward as ak
>>> array = ak.Array([
... [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
... [],
... [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
... ] * 1000000)
>>> print(f'{{"length": {length}, "form": {form.to_json()}}}')
{"length": 3000000, "form": {"class": "ListOffsetArray", "offsets": "i64", "content": {"class": "RecordArray",
"fields": ["x", "y"], "contents": [{"class": "NumpyArray", "primitive": "float64", "inner_shape": [], "parameters":
{}, "form_key": "node2"}, {"class": "ListOffsetArray", "offsets": "i64", "content": {"class": "NumpyArray",
"primitive": "int64", "inner_shape": [], "parameters": {}, "form_key": "node4"}, "parameters": {}, "form_key":
"node3"}], "parameters": {}, "form_key": "node1"}, "parameters": {}, "form_key": "node0"}}
>>> blobs
{
'node0-offsets': array([ 0, 3, 3, ..., 4999998, 4999998, 5000000]),
'node2-data': array([1.1, 2.2, 3.3, ..., 3.3, 4.4, 5.5]),
'node3-offsets': array([ 0, 1, 3, ..., 14999991, 14999995, 15000000]),
'node4-data': array([1, 1, 2, ..., 3, 4, 5]),
}
>>> ak.from_buffers(form, length, blobs)
<Array [[{x: 1.1, y: [1]}, ..., {...}], ...] type='3000000 * var * {x: floa...'> Being able to separate the small, richly-typed metadata from the large, simply-typed binary blobs is a great building block for making formats. For instance, we can put these things in HDF5, use them as a pickle format, and as a way to hand data over the border between C and Python: you only need a string (JSON) for the small, richly-typed metadata and a mapping from names to pointers for the large, simply-typed binaries. But this is a building block, not a whole format. Another desirable feature is the ability to put these things in existing container formats, with more than one histogram per container. ROOT users have benefited from the ability to make collections of histograms, sometimes organized in directories, sometimes accompanying non-histogram data, all in a single file that can be emailed (or CERNBoxed, or whatever). For containers, ROOT, HDF5, and ZIP all seem like good candidates, and if the So it's a two-step serialization: graph TD;
A(histogram in memory)-->B(metadata object and named binary blobs);
B-->C(physical file: ROOT, HDF5, ZIP);
For ROOT, the natural choice for a binary blob is a TBasket, and the "name" in this case would be a seek location. The rich metadata would be more natural as class instances than JSON, since ROOT has a lot of tools for inspecting them (and Uproot should automatically be able to read them, if they're not too complicated). A ROOT user would see a histogram as an object in a directory, and (unlike TH*) be able to read the metadata of many histograms without the cost of loading and decompressing all the bin contents. Just as a TTree delays the reading of its TBaskets, the boost-histograms would delay the reading of their binary data. For HDF5, the natural choice to hold everything for one histogram is a Group, with the rich metadata as a JSON-valued attribute for that Group. For ZIP, all of the binary blobs, including the metadata, would be files within the ZIP archive. All the data for one histogram can be grouped in a directory (the directory name is the histogram name), and the metadata can have a special name within the directory. To wrap this up1, it's precisely because of the fact that histograms are big, and also because there are often many of them, that I'm suggesting a split between the metadata and the bins/edges. Once you have that split, it becomes reasonable to consider multiple container formats, just as the Footnotes
|
Beta Was this translation helpful? Give feedback.
-
See scikit-hep/uhi#105 for the proposed standard. |
Beta Was this translation helpful? Give feedback.
-
There are two competing requirements for a common serialization format.
a) fast, efficient, opaque
b) slow, inefficient, transparent (human-readable)
The common serialization format should allow Boost.Histogram in C++ to exchange data with boost-histogram in Python, among other things.
TODO: Add options which we have discussed so far with pros and cons.
TODO for Hans: Explain how the Boost.Serialization protocol works that we already use and share between C++ and Python and we should continue using.
Beta Was this translation helpful? Give feedback.
All reactions