Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Write performance with millions of rows #492

Open
jonr667 opened this issue May 25, 2020 · 2 comments
Open

Write performance with millions of rows #492

jonr667 opened this issue May 25, 2020 · 2 comments

Comments

@jonr667
Copy link

jonr667 commented May 25, 2020

Hi folks,
First thank you for all the hard work that has gone into Uproot, it's pretty amazing!

I'm currently experience some performance issues with writing out TTree's and wondering if I'm just doing something wrong. My event detector combines together multiple instrument hits into variable size events. In memory I'm just storing these as a list of dicts such as

[{'pulse_height': [343, 43], 'chan': [2, 1], 'timestamp': 12345678, "hit_count": 2}),
 {'pulse_height': [1234], 'chan': [1], 'timestamp': 12345679, "hit_count": 1})]

Currently I'm writing out around 1.4million entries in the list. I use a buffer I can set at run time for how many of these to flush out to the ROOT file at a time. I then do one command to get a file handle :

file_handle = uproot.recreate(output_filename)
file_handle["EVENT_NTUPLE"] = uproot.newtree({"pulse_height": uproot.newbranch(numpy.dtype(">i8"), size="hit_count"),
                                                  "chan": uproot.newbranch(numpy.dtype(">i8"), size="hit_count"),
                                                  "timestamp": uproot.newbranch(numpy.dtype(">i8"))}, compression=None) 

and write the buffer via :

a = awkward.fromiter(particle_events)
file_handle["EVENT_NTUPLE"].extend({"pulse_height": a.contents["pulse_height"],
                                        "chan": a.contents["chan"],
                                        "timestamp": a.contents["timestamp"],
                                        "hit_count": a.contents["hit_count"]})

I currently set the buffer size to 100,000 events so the write code above is called for each block of 100,000 entries. Ideally I'd like to set this buffer to several million for some multiprocessor performance gains in earlier parts of the code. However this currently runs quite slow and the larger I make the buffer past ~10,000 events the slower it gets. To write out 1.4million events in 20k chunks takes >4min even with PyPy3.

(pyuproot36) vagrant@decayspec-analysis:/vagrant/decayspec_midas2root$ pypy3 --version
Python 3.6.9 (?, Apr 10 2020, 19:47:05)
[PyPy 7.3.1 with GCC 7.3.0]

Seems to do the same with normal Python 3.6 using Anaconda. I've also tried using just the basket method.

file_handle["EVENT_NTUPLE"]["pulse_height"].newbasket(a.contents["pulse_height"])
file_handle["EVENT_NTUPLE"]["chan"].newbasket(a.contents["chan"])
file_handle["EVENT_NTUPLE"]["timestamp"].newbasket(a.contents["timestamp"])
file_handle["EVENT_NTUPLE"]["hit_count"].newbasket(a.contents["hit_count"])

with no discernible gain. If I load in the normal ROOT Python interface it takes ~40seconds, when converting to pandas dataframe and using either to_hdf or to_csv it also takes ~40seconds.

Maybe I'm just doing something silly or is there still quite a bit of work needed on the TTree writing component?

Thank you!
-jon

@jpivarski
Copy link
Member

Without having a chance to look specifically into this, I should point out that performance was a lower priority for writing than it was for reading (because writing is a much more constrained problem). As written data grow beyond initially prescribed boundaries, objects need to be rewritten and all pointers to them need to be updated to keep the file consistent.

That said, I'll be taking a look at the writing code soonish, integrating it into Uproot4. I'm not expecting to find performance bugs (mistakes that should be fixed, "premature optimization" aside, like Shlemiel the painter’s algorithm). However, if there's something fundamental to fix or even just small tweaks, they'll be implemented in the new code, taking the original code as a correctness baseline.

So to answer your question, you shouldn't be thinking of the writing component as a performance-first thing. It exists for compatibility, though I'll be giving it an end-to-end review soon, along with everything else.

@jonr667
Copy link
Author

jonr667 commented May 26, 2020

No worries, thank you for your response. Figured I'd at least check to make sure I wasn't doing anything terrible.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants