Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Processing multiple root files #539

Open
ico1036 opened this issue Jul 9, 2021 · 2 comments
Open

Processing multiple root files #539

ico1036 opened this issue Jul 9, 2021 · 2 comments

Comments

@ico1036
Copy link

ico1036 commented Jul 9, 2021

What is the most efficient way to deal with multiple root files (~100G) in uproot3 and uproot4?
I cannot find tutorial about this.

I tried the lazy array but it takes a lot of time.

# PATH
dir_path = "/x4/cms/dylee/Delphes/data/root/signal/*/*.root"
file_list = glob.glob(dir_path)

# IO
cache = uproot.ArrayCache("2 GB")
events = uproot.lazyarrays(file_list, "Delphes", ['Electron*',"Muon*","Photon*","MissingET*"],cache=cache)

# Define Particle arrays
Electron = ak.zip(
    {
        "PT": events["Electron.PT"],
        "Eta": events["Electron.Eta"],
        "Phi": events["Electron.Phi"],
        "T": events["Electron.T"],
        "Charge": events["Electron.Charge"],
    }
)

Also, I tried the iterator but I'm not sure this loop-based method is efficient (https://github.com/JW-corp/J.W_Analysis/blob/main/Uproot/test/big_data.py)

Thanks.

@ico1036 ico1036 changed the title Multiple files Processing multiple root files Jul 9, 2021
@jpivarski
Copy link
Member

Lazy arrays are good for interactive exploration, but the most efficient way to process multiple files with Uproot only is uproot.iterate (because it ensures that only a manageable amount of data is in memory at once).

I say "using Uproot only" because if you have a very large number of files, you'll want to distribute the job and run it in parallel. Uproot doesn't do that (as it's strictly an I/O library). Coffea Processors are a convenient way to do it on HEP.

@ico1036
Copy link
Author

ico1036 commented Jul 12, 2021

Lazy arrays are good for interactive exploration, but the most efficient way to process multiple files with Uproot only is uproot.iterate (because it ensures that only a manageable amount of data is in memory at once).

I say "using Uproot only" because if you have a very large number of files, you'll want to distribute the job and run it in parallel. Uproot doesn't do that (as it's strictly an I/O library). Coffea Processors are a convenient way to do it on HEP.

Thank you very much!
I tested this script and checked following results:

  • 47 number of files, 470,000 number of events
  • uproot3 with lazy: 153s
  • uproot3 with iterate: 54s
  • uproot4 with iterate: 22s

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants