Kaitai and scientific data #711

pibion · 2020-03-02T20:20:13Z

The data-format issue that Kaitai addresses is incredibly important for scientific data. But most of the Kaitai tools don’t handle GB-scale files efficiently, or use data structures that are efficient for typical queries on these datasets.

I’m interested in applying for grants to fund work to make Kaitai fit more scientific-data use cases. For example, right now a student is working on adding scikit-hep’s awkward1.0 as a target Kaitai language.

Does a potential influx of two to three scientists working on developing some aspects of Kaitai fit into how you’d like to see Kaitai developed?

Would the Kaitai project leaders be interested in meeting and having a discussion about possible collaboration on grants?

KOLANICH · 2020-03-02T20:40:59Z

#65, #277, #44 and kaitai-io/kaitai_struct_python_runtime#25 and probably finding and solving the issues with qtproject/installer-framework#7 (parsing a small struct (just metadata containing boundaries of 7z files in a big file, the actual data occupying the majority of the file is not read) in a 3 GiB file consumes more than 12 GiB of RAM, this is clearly inacceptible) can be relevant to the task.

IDK exactly, since noone has implemented it and tested it in reality, but for me "laying out structures over memory-mapped files" sounds like a prereq for "storing data on medium, not in RAM" which sounds like an absolute prereq for "handling large data files" (of course we can load the chunks ourselves using read, but it feels like more complex and fragile and less performant, though may give a bit smaller memory footprint in some cases).

pibion · 2020-03-05T18:48:26Z

@KOLANICH I suspect that you're exactly right that some of the features we need require memory-mapping files. In some cases we want to read only select portions of a file and memory mapping (or at least partial mapping) seems like a prerequisite.

Several physics libraries do manage this by providing a read function that allows loading chunks into memory. It would be nice to avoid but it is a somewhat-workable option for our community.

My group is starting out by adding a columnar data store target (scikit-hep’s awkward1.0) to Kaitai. This iteration reads the entire file into memory, which is okay for some scientific data.

But there’s lots of GB files out there that we’d like to be able to read (or read selectively) in python and scan through with the web and Ruby viewers, so memory mapping seems like something we’ll have to do.

GreyCat · 2020-03-11T12:14:16Z

@pibion Apologies for late reply, unfortunately, I get very little time to spend on KS nowadays.

But most of the Kaitai tools don’t handle GB-scale files efficiently,

That is true, and for some tools (like WebIDE) it is not likely to change due to how browsers local storage works. Probably we can plan some of the relevant work for more desktop-based tools (i.e. ksv, kaitai_struct_gui, etc).

or use data structures that are efficient for typical queries on these datasets.

Overall, the idea of KS is to describe structure of the data, not generate the nicest API that will be "easiest to use", "most performant", etc, as majority of these asks are very relevant to a particular task, and not the structure of format itself. That said, I agree that there's massive room for improvement in terms of adding certain hints to the compiler to generate more optimal code.

Does a potential influx of two to three scientists working on developing some aspects of Kaitai fit into how you’d like to see Kaitai developed?

Any kind of contributions would be great, the only problem that I see is that I personally won't be able to spend a lot of time reviewing / curating these contributions. We had some previous contributions attempts from unmotivated students, and these, unfortunately, didn't went so well.

Would the Kaitai project leaders be interested in meeting and having a discussion about possible collaboration on grants?

We can plan a voice chat if you want. Please contact me at [email protected] if you want to arrange that.

pibion · 2020-03-12T00:34:53Z

@GreyCat this is a very prompt reply for my community :)

That is true, and for some tools (like WebIDE) it is not likely to change due to how browsers local storage works. Probably we can plan some of the relevant work for more desktop-based tools (i.e. ksv, kaitai_struct_gui, etc).

Targeting desktop tools is exactly what we had in mind. A GB-enabled WebIDE would be amazing, but that's beyond our immediate scope.

Overall, the idea of KS is to describe structure of the data, not generate the nicest API that will be "easiest to use", "most performant", etc, as majority of these asks are very relevant to a particular task, and not the structure of format itself.

This is exactly what drew me to KS initially. The only other descriptive data format I've encountered is DFDL, which uses XML rather than YAML. I find that DFDL isn't as readable as Kaitai (although it's not so bad once you get used to it). But the real issue is that DFDL isn't designed to support multiple languages like Kaitai, and us physicists love our C++ and python.

Currently I have a student who's working to write a new compiler that builds C++ code that stores data in scikit-hep's AwkwardArrays rather than in C++ objects. As you mention, this is probably useful for only some people - people with data that represents many discrete "events" might find it useful.

generalmimon added the question label Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kaitai and scientific data #711

Kaitai and scientific data #711

pibion commented Mar 2, 2020

KOLANICH commented Mar 2, 2020 •

edited

Loading

pibion commented Mar 5, 2020

GreyCat commented Mar 11, 2020

pibion commented Mar 12, 2020

Kaitai and scientific data #711

Kaitai and scientific data #711

Comments

pibion commented Mar 2, 2020

KOLANICH commented Mar 2, 2020 • edited Loading

pibion commented Mar 5, 2020

GreyCat commented Mar 11, 2020

pibion commented Mar 12, 2020

KOLANICH commented Mar 2, 2020 •

edited

Loading