Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Extend uproot.newtree for use with pandas dataframes? #416

Open
adamdddave opened this issue Dec 4, 2019 · 6 comments
Open

Extend uproot.newtree for use with pandas dataframes? #416

adamdddave opened this issue Dec 4, 2019 · 6 comments

Comments

@adamdddave
Copy link

adamdddave commented Dec 4, 2019

Is it possible to extend the uproot.newtree functionality take a pandas dataframe as input? It seems possible as the type to be written can be inferred from the dtype of the column. The only catch I see is having dtype = 'o', which could be ignored.

I guess one implementation could be

new_branches = list(df)
new_branch_types = df.dtypes
#convert to uproot known types
new_tree = uproot.new_tree(dict(new_branches,new_branch_types))
new_tree.extend(dict(k , df[k].values for k in new_branches))
@jpivarski
Copy link
Member

In principle, the TFile.__setitem__ could be made to detect when the right-hand side is a Pandas DataFrame and do the whole newtree/newbranchesextend cycle, the same way that it currently recognizes NumPy histograms.

That is, just as we can now

tfile["some_hist"] = (np.array(...), np.array(...))

we'd be able to

tfile["some_tree"] = pd.DataFrame(...)

That's not a bad idea. I'm leaving this as an open issue in case someone wants to take it up.

@adamdddave
Copy link
Author

I can give it a try, though if someone beats me to it, all the better :)

@jpivarski
Copy link
Member

That would be great, if you have a chance.

For reference, the histogram types are translated in a systemized way in uproot-methods: https://github.com/scikit-hep/uproot-methods/blob/9e98414d5c155fa902d13cf40d1c66dd0a1461d4/uproot_methods/convert.py#L14-L54

However, you probably can't just add this case to that because that mechanism converts each histogram type into an object with the fields uproot is looking for—nothing dynamic. This conversion is a little different because the TTree interface is not just "__setitem__ and we're done" but "__setitem__ to initialize the structure, then extend progressively fills it." (Because TTrees can be larger than memory.) That second step, mutating the object after it has been inserted with __setitem__ is beyond the uproot-methods mechanism.

TTrees are special, so they can be handled with special code in uproot. (I put all the histogram-handlers in uproot-methods because data analysis types get beyond uproot's mission of being only I/O.) I don't think it would be a bad separation of concerns to put this special-case check directly in TTree.__setitem__: https://github.com/scikit-hep/uproot/blob/163bf0ab0a5b9d16e7aee61b8ab19e0b0412a83d/uproot/write/TFile.py#L77-L101

Just be sure that you detect the DataFrame without forcing Pandas to be loaded. The user might not even have Pandas, and they wouldn't want it to be "accidentally" imported just to find out whether the thing on the right-hand side might be a DataFrame. (It might not.) You can use something like this to check the object's type non-invasively: https://github.com/scikit-hep/uproot/blob/163bf0ab0a5b9d16e7aee61b8ab19e0b0412a83d/uproot/tree.py#L119

It might sound like a bad idea to check an object's type by string, but in Python, that's essentially what any type check is. (When you import a module, that's a particular name in a global namespace.) It would be a problem if Pandas moves the internal location of DataFrame from "pandas.core.frame" to somewhere else, but we can deal with rare changes like this.

Thanks!

@adamdddave
Copy link
Author

Can I still commit here or should I push to uproot4 instead?

@jpivarski
Copy link
Member

I'll still see it here. I've been moving other issues to uproot4 because that's where the new development is happening, but due this one, file-writing hasn't started in uproot4 yet, and it will likely have a different interface (to try to learn from issues faced with this one), so comments on that are probably more relevant here than there.

@adamdddave
Copy link
Author

Great. Will try to have the PR here by the end of the week

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants