-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kaitai and scientific data #711
Comments
#65, #277, #44 and kaitai-io/kaitai_struct_python_runtime#25 and probably finding and solving the issues with qtproject/installer-framework#7 (parsing a small struct (just metadata containing boundaries of 7z files in a big file, the actual data occupying the majority of the file is not read) in a 3 GiB file consumes more than 12 GiB of RAM, this is clearly inacceptible) can be relevant to the task. IDK exactly, since noone has implemented it and tested it in reality, but for me "laying out structures over memory-mapped files" sounds like a prereq for "storing data on medium, not in RAM" which sounds like an absolute prereq for "handling large data files" (of course we can load the chunks ourselves using |
@KOLANICH I suspect that you're exactly right that some of the features we need require memory-mapping files. In some cases we want to read only select portions of a file and memory mapping (or at least partial mapping) seems like a prerequisite. Several physics libraries do manage this by providing a My group is starting out by adding a columnar data store target (scikit-hep’s awkward1.0) to Kaitai. This iteration reads the entire file into memory, which is okay for some scientific data. But there’s lots of GB files out there that we’d like to be able to read (or read selectively) in python and scan through with the web and Ruby viewers, so memory mapping seems like something we’ll have to do. |
@pibion Apologies for late reply, unfortunately, I get very little time to spend on KS nowadays.
That is true, and for some tools (like WebIDE) it is not likely to change due to how browsers local storage works. Probably we can plan some of the relevant work for more desktop-based tools (i.e. ksv, kaitai_struct_gui, etc).
Overall, the idea of KS is to describe structure of the data, not generate the nicest API that will be "easiest to use", "most performant", etc, as majority of these asks are very relevant to a particular task, and not the structure of format itself. That said, I agree that there's massive room for improvement in terms of adding certain hints to the compiler to generate more optimal code.
Any kind of contributions would be great, the only problem that I see is that I personally won't be able to spend a lot of time reviewing / curating these contributions. We had some previous contributions attempts from unmotivated students, and these, unfortunately, didn't went so well.
We can plan a voice chat if you want. Please contact me at [email protected] if you want to arrange that. |
@GreyCat this is a very prompt reply for my community :)
Targeting desktop tools is exactly what we had in mind. A GB-enabled WebIDE would be amazing, but that's beyond our immediate scope.
This is exactly what drew me to KS initially. The only other descriptive data format I've encountered is DFDL, which uses XML rather than YAML. I find that DFDL isn't as readable as Kaitai (although it's not so bad once you get used to it). But the real issue is that DFDL isn't designed to support multiple languages like Kaitai, and us physicists love our C++ and python. Currently I have a student who's working to write a new compiler that builds C++ code that stores data in scikit-hep's AwkwardArrays rather than in C++ objects. As you mention, this is probably useful for only some people - people with data that represents many discrete "events" might find it useful. |
The data-format issue that Kaitai addresses is incredibly important for scientific data. But most of the Kaitai tools don’t handle GB-scale files efficiently, or use data structures that are efficient for typical queries on these datasets.
I’m interested in applying for grants to fund work to make Kaitai fit more scientific-data use cases. For example, right now a student is working on adding scikit-hep’s awkward1.0 as a target Kaitai language.
Does a potential influx of two to three scientists working on developing some aspects of Kaitai fit into how you’d like to see Kaitai developed?
Would the Kaitai project leaders be interested in meeting and having a discussion about possible collaboration on grants?
The text was updated successfully, but these errors were encountered: