-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
podio::DataSource #309
podio::DataSource #309
Conversation
This will be very nice to have, thanks! I'll take a closer look this week. For the naming, I'd suggest "DataSource" instead of "Source". This should also eventually move to the edm4hep repository, right? |
Thanks for the suggestion I renamed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I have started to have a look and I have made some more or less detailed comments below. I also have a few questions about how this is supposed to work in the end for RDataFrame, and about some of the concepts there.
First a conceptual one: Does a slot
in RDataFrame correspond to a thread? Following up on that, IIUC the basic idea here is to have one reader per slot / thread and then also one Frame per slot / thread? I am not entirely sure how much overhead this is memory wise, but it might be possible to only have one reader but multiple Frames. However, I am not entirely sure how easy it would be to make this work with the RDF infrastructure.
Then one from the usage side. In the end this would be used as free functions that get "proper" edm4hep collections as arguments and return new (subset) collections? e.g.
df.Define("reco_pt", "getPt(ReconstructedParticles)")
or would the idea be to have them as sort of "member functions" in the end? (Is that even possible?)
Then finally one that already goes beyond this PR, but sort of fits anyhow: Can the Snapshot mechanism be made to produce EDM4hep output again from this?
EDM4hepDataSource(const std::vector<std::string>& filePathList, | ||
int nEvents = -1); | ||
|
||
void SetNSlots(unsigned int nSlots); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the virtual functions here be marked with either final
or override
, just so that the compiler has a chance to detect mismatches here?
std::vector<std::unique_ptr<podio::ROOTFrameReader>> m_podioReaders; | ||
/// Podio frames | ||
std::vector<std::unique_ptr<podio::Frame>> m_frames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how RDataFrame handles this internally and which parts need to be shared over several threads, and which are managed by RDataFrame and we basically can assume that it will only be accessed from only one thread.
From a podio point of view:
ROOTFrameReader
/ROOTLegacyReader
: no internal thread safety at all, i.e. either make sure that each thread has it's own reader or put a lock / mutex around it forreadEntry
/readNextEntry
.Frame
: fully thread safe. Access from as many thread as you want, including putting new things into it. With the caveat that this uses a lock / mutex internally, so there could be some lock congestion if used from very many threads
e4hsource/src/EDM4hepDataSource.cxx
Outdated
if (std::find(m_columnNames.begin(), | ||
m_columnNames.end(), | ||
columnName) != m_columnNames.end()) { | ||
return true; | ||
} | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (std::find(m_columnNames.begin(), | |
m_columnNames.end(), | |
columnName) != m_columnNames.end()) { | |
return true; | |
} | |
return false; | |
return std::find(m_columnNames.begin(), | |
m_columnNames.end(), | |
columnName) != m_columnNames.end(); |
Could ditch the if
here and simply return the result directly.
e4hsource/src/EDM4hepDataSource.cxx
Outdated
<< m_columnTypes.at(i) << std::endl; | ||
|
||
return m_columnTypes.at(i) + "Collection"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< m_columnTypes.at(i) << std::endl; | |
return m_columnTypes.at(i) + "Collection"; | |
<< m_columnTypes[i] << std::endl; | |
return m_columnTypes[i] + "Collection"; |
No need to do another range-check here, since we have already done that just above.
ROOT::VecOps::RVec<float> | ||
getPt(const edm4hep::ReconstructedParticleCollection& inParticles); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is most likely beyond the scope if this pull request, but there is probably quite a bit of (kinematic) functionality that could be templated on the collection type and then simply forwarded to other utility functionality, e.g. the one that already comes with EDM4hep
struct selPDG { | ||
selPDG(const int pdgID, const bool chargeConjugateAllowed = false); | ||
const int m_pdg; | ||
const bool m_chargeConjugateAllowed; | ||
edm4hep::ReconstructedParticleCollection operator() ( | ||
const edm4hep::MCRecoParticleAssociationCollection& inAssocColl); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this is most likely beyond the scope if this PR, but this also looks general enough to be templated (but I a likely missing a detail here).
Can we split the DataSource itself off of this and get it into edm4hep directly? I tested it a bit and it looked ready to use? |
If putting the datasource into EDM4hep would be (more or less) easily possible, I would also be in favor of having it there. That should make maintenance a bit easier and we would see earlier if we break it when we do changes in EDM4hep. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only had a quick look at some of the files and made a few smaller comments.
Overall, what is the (longer term) plan for FCCAnalysis for switching to the podio datasource? This PR suggests that at least for some time the "old way" and the "new way" will coexist for some time. Since EDM4hep has changed enough such that newer files cannot be easily read with older versions of FCCAnalysis, I am wondering whether it makes sense to have the two things in parallel, or whether it would be easier to simply make a clean cut and move everything to the data source way.
Some of the functionality might also be general enough for it to be moved to EDM4hep? E.g. getting the pt (or four momentum) for any datatype that has a momentum (and at least a mass or energy).
CMakeLists.txt
Outdated
set(WITH_SOURCE ON CACHE STRING "Enable PODIO Data Source") | ||
set_property(CACHE WITH_SOURCE PROPERTY STRINGS ON OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set(WITH_SOURCE ON CACHE STRING "Enable PODIO Data Source") | |
set_property(CACHE WITH_SOURCE PROPERTY STRINGS ON OFF) | |
set(WITH_PODIO_DATASOURCE ON CACHE STRING "Enable PODIO Data Source") | |
set_property(CACHE WITH_PODIO_DATASOURCE PROPERTY STRINGS ON OFF) |
Just to give this a slightly more discernible name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, used WITH_PODIO_DATASOURCE
template <typename T> | ||
T sel_PDG::operator() (const T& inAssocColl) { | ||
T result; | ||
result.setSubsetCollection(); | ||
|
||
for (const auto& assoc: inAssocColl) { | ||
const auto& particle = assoc.getSim(); | ||
if (particle.getPDG() == m_pdg) { | ||
result.push_back(assoc); | ||
} | ||
} | ||
|
||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definition needs to go to the header file, otherwise we will most likely run into linker errors. Unless, there are explicit template instantiations that I am missing somewhere. Similar for others cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved also with other ones.
* with the MC particle with the desired PDG ID. | ||
*/ | ||
edm4hep::ReconstructedParticleCollection operator() ( | ||
const edm4hep::MCRecoParticleAssociationCollection& inAssocColl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const edm4hep::MCRecoParticleAssociationCollection& inAssocColl); | |
const edm4hep::RecoMCParticleLinkCollection& inAssocColl); |
The Association
types are deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything was converted into Links.
lVec.SetXYZM(particle.getMomentum().x, | ||
particle.getMomentum().y, | ||
particle.getMomentum().z, | ||
particle.getMass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be done via edm4hep::utils::p4(particle, edm4hep::utils::UseMass)
from the edm4hep/utils/kinematics.h
header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint, used...
return std::sqrt(std::pow(p.getMomentum().x, 2) + | ||
std::pow(p.getMomentum().y, 2)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here edm4hep::utils::pt
can also do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks, this was just work in progress commit...
I would like to ditch "old way" at some point, but I'm not sure we can make clean cut quickly. There is a lot of analysis code which uses the "old way" and also the performance of the datasource way needs to be better understood.
Upstreaming some of the functionalities to edm4hep is the good point. How would you handle generating of those functions? With C++ templating or something like Podio(jinja2)? |
I think unfortunately the performance will always be worse. Podio gives us an AoS style access in memory while when we access the root file directly we have SoA. I.e. using the podio datasource will potentially always result in reading also unneeded data. Unless I am confusing something... Maybe @m-fila wants to comment? :) |
Good point. Didn't think of that.
There are a few things to consider. The major difference is that the podio data source currently has no way to only read selective branches / collections only. It will always read the full event. On top of that reconstructing the relations takes a bit of time. At some point that might be an acceptable overhead, but how large the analysis has to be for that is unclear to me. There are also almost certainly some optimizations that can be (or are already) done if the podio middleman is cut out of the whole thing. The memory layout is another thing, but that might even be something we can change in podio, because the interface should give us enough abstraction to switch the implementation below from AoS to SoA, but this has to be investigated further. |
DataSource providing EDM4hep collections. This allows writing analyzers using EDM4hep collections directly.
Example:
Two implementations are provided, frame and legacy.
To run the analysis using these analyzers
--use-source
or--use-legacy-source
.Implementation: The class keeps three vectors
m_Collections
,m_podioReaders
andm_frames
.Thread safety: questionable