-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
+Arrow/Feather #37
Comments
I second this proposal! Performance is the main problem with OMX as written.
Coincidentally I was thinking of building an arrow-based Python proof of
concept for this just last week. Did your proposal come out of some other
conversations recently ?
..b
…On Mon, Sep 14, 2020, 20:34 Elizabeth Sall ***@***.***> wrote:
I'd like to propose that we evaluate the feasibility to support the faster
Arrow <https://arrow.apache.org/>-based data format.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#37>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK4QRX53KZ44BWPBCGS4MLSFZO4VANCNFSM4RL55LMA>
.
|
Proposal based on:
|
I would love for the next iteration of OMX to be based on Arrow, but is the objective of OMX to be used in production now? |
That's a good question for the organizing group (which is who, these days?). In practice, it is being used in production. |
I also use it in production and made AequilibraE capable of using it as well. However, if the OMX mission changes, then I would say it would be worth it to explore other data formats to make sure we get it right. |
I don't see improved performance of OMX as being a change of mission! Our
tech should be useful and frictionless, to help spur adoption.
Existing OMX files have a "VERSION 1" key embedded in them, precisely
because we wanted the format to be changeable if the need arose. We always
knew that performance of HDF5 is not great because of its slow compression
library. There just weren't better alternatives at the time.
…On Tue, Sep 15, 2020 at 3:40 AM Pedro Camargo ***@***.***> wrote:
I also use it in production and made AequilibraE capable of using it as
well. However, if the OMX *mission* changes, then I would say it would be
worth it to explore other data formats to make sure we get it right.
Also, would we ask software providers to switch to the new format? Or will
we support both?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#37 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAK4QRX4NQVF7URQNDQHLKTSF3AZDANCNFSM4RL55LMA>
.
|
I think that supporting Arrow-based format and other formats in the future is probably necessary if OMX is to endure as anything more than an exchange format. The spec would have to become more abstract. One issue will be how specific should the spec be about data structure. For example it is my understanding (limited) that Arrow supports storage of tabular data in columnar format where each column can store a different data type. This is the approach that VisionEval takes. OMX stores matrix data in a matrix format. So what should the spec say in that regard. There might need to be a part of the specification to deal with each type of backend that is supported: if HDF5 how structured, if Arrow how structured, etc. Or maybe is entirely functional, identifying functions that must be supported. |
@billyc , I was referring to using OMX for in production and not as a common format for transfer between platforms (the latter was my understanding of the mission, but I am probably mistaken and remember only part of the mission). |
I like this idea and I like the idea of discussing this. Anyone interested in discussing please comment on this thread and then we can brainstorm next steps - maybe a meeting to discuss, maybe a prototype, etc. Thanks! |
If we're thinking about a next version, let's include other potential ideas as well - more flexibility, more data types, better API conformity, CI for testing APIs, better viewers, etc. |
@bstabler - perhaps:
|
Apparently I was not "watching" and didn't see this conversation initially. Count me in 👍 |
How interesting! HDF5 is primarily a disk storage format, with an option to force in-memory. Arrow is exclusively an in-memory format, right? So, the two are complimentary. I've never been a big fan of HDF5, but don't see Arrow as a way to get away from HDF5. Arrow sure would be nice to enable us to use higher performance libraries and not have to go through disk storage just to work in another platform or language for a bit. |
Feather is its on-disk complement |
And Arrow+Feather is ridiculously fast... |
Did some noodling on this over the weekend. +1 to ridiculously fast ... not just "I don't want to wait while the data saves to disk" fast, but bordering on "I don't need to load skims into RAM to use them" fast. |
Talk is cheap. Here instead is a straw man proposal for you all to beat around a bit. https://github.com/jpn--/arrowmatrix |
Quite impressive results and effort, @jpn-- ! |
The development of the PyTables project (on which OMX relies) seems to be quite slow these days, and there doesn't seem to be any hurry in supporting the newly released Python 3.9 |
I wouldn't worry too hard about not having wheels out on PyPI supporting 3.9 yet. The same applies to plenty of other relevant and very active projects <cough>pyarrow</cough>. Both have 3.9 support on conda-forge. |
My concern is a little more with the frequency of updates to the library, @jpn-- , but you are right that the 3.9 release in itself is nothing to worry about for now. |
Dear Pedro and Jeffrey, thanks to @avalentino, PyTables-cp39 wheels for Linux are available on PyPI now. See also PyTables/PyTables#823 (comment). With kind regards, |
Has anybody looked further into this change? PyTables still does not have wheels for Python 3.9 for either Windows or macOS, so I would say that the case for migrating to Arrow is getting even better... |
@toliwaga did some further comparisons of HDF5 versus Arrow/Feather for ActivitySim and the performance gains were not great. If I recall correctly, the use case of reading several full matrices into RAM, which is what we're typically doing for activity-based models because we need to random access hundreds of millions of cells as fast as possible, was underwhelming. Maybe @toliwaga can add some more details? Nevertheless, I'm supportive of developing and releasing an updated version, say v0.3, of OMX that supports either HDF5 or Arrow/Feather because it's popular, supported, and faster under some additional use cases. |
It would be great to see the results of those comparisons here, if @toliwaga is willing to share them. Otherwise someone will probably ask for it again :-) |
My concern, besides the fact that HDF5 has lost a lot of momentum in favor of more modern formats such as arrow and feather, is that the use case of just loading all arrays to disk once is a rather narrow one, @billyc |
Fully agree. Even within the scope of a travel model, there are lots of uses for the matrices used/created in travel models beyond "running the actual model". I'm surprised that there wasn't a significant amount of time saved. Based on some of what I've read there should be time saved on the read/write in addition to I/O as well as significant RAM improvements. The RAM improvements alone should be something to consider as it could reduce the need for specialized "modeling machines". Another thing to consider is if arrow/feather is the right "storage" mechanism beyond intra-runs or if parquet is (which is considered "archival"). Ideally OMX would deal with either. |
Beyond all the above reasons, hdf5 doesn't have any bindings for Javascript (and likely never will) -- so it's literally impossible to access OMX skims from front-end browser code without relying on a node-server to broker any requests. It sounds like we have more than enough justifications to at least keep exploring this. |
The work by @toliwaga on this was in the context of ActivitySim. Overall the time spent loading HDF5 OMX data in an ActivitySim model is tiny compared to the runtime of the whole model -- cutting the plain load time from say 50 seconds to 10 seconds (not @toliwaga's results, just some approximate numbers from what I've played with) doesn't matter much when running everything else takes hours, and that makes it not worth a ton of development effort on the part of the ActivitySim consortium. But as we all agree, that's just one use case. So I'd like to invite all of you who are interested to look at the straw man proposal I put forth a few months ago, and particularly the implementation details. Post here some thoughts about what's good and what's bad in there. From some more concrete thoughts perhaps we can move past "yes we should talk more about this" to actually outlining a new set of principles we want to pursue in the next version of the standard. |
Sorry to be so slow in responding - I took a very long (and wonderful) summer vacation and am only just sorting through all stuff that happened while I was away. I agree with @jpn that the activitysim use case is not representative and so my observations may have little bearing on this question. Activitysim is a long running program with many models that do repeated lookups of various skims. The ordinary use case is that Activitysim loads all of the skims into memory once at the start of the run, stores them in a large 3-dimensional numpy array (which is stored in shared memory when multiprocessing.) The various models access individual skims or skim sets (e.g. drive time for different time periods) via wrappers designed for convenience and legibility in expression files. The initial load time is not very important - what is important is that subsequent skim references are fast and are stored in a way that can be shared across processes. @jpn presented a straw man proposal that, in addition to other possible advantages, suggested that it might be possible to avoid the runtime and memory overhead of preloading the skims and instead reading them just-in-time for skim lookup. The example showed both good performance and promising near-zero memory footprint. I played around with that approach to see if it might be possible to use feather files as an alternative to in-memory skims. I decided to investigate, and explored his approach. The first problem I ran into was that I found that accessing all skims would eventually bring all the skim data into memory. As the documentation says "Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory." This wasn't apparent in the example @jpn provided because it accessed the same skim repeatedly so the gradual increase in memory usage wasn't apparent. I couldn't find any way to free the memory short of opening and closing the file at every access - which slowed the process down. However, the rapidity of feather file opening suggested a different, analogous approach which I then explored. I implemented a numpy memmapped skim_dict class as an alternative to the existing activitysim in-memory array version. By opening and closing the memmap file just-in-time to perform skim or dkim_stack lookups, the memmap implementation avoided the 'leakage' associated with Jeff's approach - at the expense of redundant (albeit rapid) loads of skim data. This resulted in a zero-overhead skim implementation with runtime performance 'only' 60% slower then in-memory skims. (A runtime handicap that could possibly be compensated by the reduced memory requirements in certain implementations. This is worth exploring. I should think it might be of interest to MPOs with truly gigantic skims. Especially if they are more constrained on the memory than the processor side.
Disabling tap-tap utility calculation (rebuild_tvpb_cache: False) shows that the memory requirements for 32-processor tour_mode_choice model run are striking low: Total memory requirements for 32 processor tour_mode_choice model step with MemMapSkimFactory are 145GB - or under 5GB per process This is all - last I checked - easily turned on and off by simply changing the skim_dict_factory setting in network_los.yaml from NumpyArraySkimFactory (the default) to MemMapSkimFactory. #skim_dict_factory: NumpyArraySkimFactory This will cause Activitysim to create a numpy memmap cache file (if it does not already exist) and which it opens and closes just-in-time for each skim access. This should work in either single or multi process mode. This was never really exhaustively tested because it was just a little side project I did on my own time - not something that was part of the funded development effort. |
anyone eager to get something going on this topic? I've been too busy to move this along. Thanks. |
I'd like to propose that we evaluate the feasibility to support the faster Arrow-based data format.
The text was updated successfully, but these errors were encountered: