Write performance with millions of rows #492

jonr667 · 2020-05-25T23:28:29Z

Hi folks,
First thank you for all the hard work that has gone into Uproot, it's pretty amazing!

I'm currently experience some performance issues with writing out TTree's and wondering if I'm just doing something wrong. My event detector combines together multiple instrument hits into variable size events. In memory I'm just storing these as a list of dicts such as

[{'pulse_height': [343, 43], 'chan': [2, 1], 'timestamp': 12345678, "hit_count": 2}),
 {'pulse_height': [1234], 'chan': [1], 'timestamp': 12345679, "hit_count": 1})]

Currently I'm writing out around 1.4million entries in the list. I use a buffer I can set at run time for how many of these to flush out to the ROOT file at a time. I then do one command to get a file handle :

file_handle = uproot.recreate(output_filename)
file_handle["EVENT_NTUPLE"] = uproot.newtree({"pulse_height": uproot.newbranch(numpy.dtype(">i8"), size="hit_count"),
                                                  "chan": uproot.newbranch(numpy.dtype(">i8"), size="hit_count"),
                                                  "timestamp": uproot.newbranch(numpy.dtype(">i8"))}, compression=None)

and write the buffer via :

a = awkward.fromiter(particle_events)
file_handle["EVENT_NTUPLE"].extend({"pulse_height": a.contents["pulse_height"],
                                        "chan": a.contents["chan"],
                                        "timestamp": a.contents["timestamp"],
                                        "hit_count": a.contents["hit_count"]})

I currently set the buffer size to 100,000 events so the write code above is called for each block of 100,000 entries. Ideally I'd like to set this buffer to several million for some multiprocessor performance gains in earlier parts of the code. However this currently runs quite slow and the larger I make the buffer past ~10,000 events the slower it gets. To write out 1.4million events in 20k chunks takes >4min even with PyPy3.

(pyuproot36) vagrant@decayspec-analysis:/vagrant/decayspec_midas2root$ pypy3 --version
Python 3.6.9 (?, Apr 10 2020, 19:47:05)
[PyPy 7.3.1 with GCC 7.3.0]

Seems to do the same with normal Python 3.6 using Anaconda. I've also tried using just the basket method.

file_handle["EVENT_NTUPLE"]["pulse_height"].newbasket(a.contents["pulse_height"])
file_handle["EVENT_NTUPLE"]["chan"].newbasket(a.contents["chan"])
file_handle["EVENT_NTUPLE"]["timestamp"].newbasket(a.contents["timestamp"])
file_handle["EVENT_NTUPLE"]["hit_count"].newbasket(a.contents["hit_count"])

with no discernible gain. If I load in the normal ROOT Python interface it takes ~40seconds, when converting to pandas dataframe and using either to_hdf or to_csv it also takes ~40seconds.

Maybe I'm just doing something silly or is there still quite a bit of work needed on the TTree writing component?

Thank you!
-jon

The text was updated successfully, but these errors were encountered:

jpivarski · 2020-05-26T13:35:22Z

Without having a chance to look specifically into this, I should point out that performance was a lower priority for writing than it was for reading (because writing is a much more constrained problem). As written data grow beyond initially prescribed boundaries, objects need to be rewritten and all pointers to them need to be updated to keep the file consistent.

That said, I'll be taking a look at the writing code soonish, integrating it into Uproot4. I'm not expecting to find performance bugs (mistakes that should be fixed, "premature optimization" aside, like Shlemiel the painter’s algorithm). However, if there's something fundamental to fix or even just small tweaks, they'll be implemented in the new code, taking the original code as a correctness baseline.

So to answer your question, you shouldn't be thinking of the writing component as a performance-first thing. It exists for compatibility, though I'll be giving it an end-to-end review soon, along with everything else.

jonr667 · 2020-05-26T17:58:05Z

No worries, thank you for your response. Figured I'd at least check to make sure I wasn't doing anything terrible.

jpivarski added performance-bug writing-improvements labels Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write performance with millions of rows #492

Write performance with millions of rows #492

jonr667 commented May 25, 2020

jpivarski commented May 26, 2020

jonr667 commented May 26, 2020

Write performance with millions of rows #492

Write performance with millions of rows #492

Comments

jonr667 commented May 25, 2020

jpivarski commented May 26, 2020

jonr667 commented May 26, 2020