Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce file duplication in MadGraph #389

Open
tomeichlersmith opened this issue May 25, 2023 · 2 comments
Open

Reduce file duplication in MadGraph #389

tomeichlersmith opened this issue May 25, 2023 · 2 comments
Assignees
Labels
Milestone

Comments

@tomeichlersmith
Copy link
Collaborator

tomeichlersmith commented May 25, 2023

Just reporting some statistics here - there is a lot of duplication of code with the current method for storing MadGraph files for later running. MG5 is bad but MG4 is even worse with some files being copied over 40 times. The tar-ball below has full listings of all the files and their md5sum as well as a sorted list of uniq md5sum with a corresponding example file. This does not handle duplicate code within files but it is a start.

mg-unique-file-listing.tar.gz

How

calculate md5sum of each file1

cd generators/madgraphN
fd -tf -x md5sum | sort > md5sum.list

get uniq files sorted by number of copies

uniq -c -w 32 md5sum.list | sort -nr > uniq.list

Footnotes

  1. using fd instead of find here since its faster. The find equivalent is find -type f -exec md5sum {} ';' | sort > md5sum.list

@JeremyMcCormick
Copy link
Member

JeremyMcCormick commented Jun 21, 2023

@tomeichlersmith After looking at this a bit, what is your conclusion about how feasible this is for us to change how we are already doing things?

From what I remember, hps-mc copies some portion of the entire source tree into the run directory with the MG components. My main concern would be that we are possibly copying in a lot of extra files that we don't need, but I don't know whether this is the case or not.

What is the difference between how MG4 and MG5 handle all this? Is MG5 better in some way with less file duplication?

@tomeichlersmith
Copy link
Collaborator Author

I will answer back-to-front.

difference between MG4 and MG5

MG5 is indeed better than MG4 in terms of avoiding file copying - that was one of the major updates that led to the major version increase.

possibly coyping in a lot of extra files that we don't need

We are almost certainly doing that. One of the issues is that some MG source files are used in one model and not used in another, so we'd need to check all of the different models we wish to support when attempting to delete any files.

feasibility

Its definitely feasible. There are several avenues of improvement but the big issue is time. Does anyone have time to do these things? Probably not...

  1. One thing I've done in the past is run the program and then check which files were accessed/read. This at least eliminates files that aren't even opened by MG/ME during running.
  2. Another avenue of improvement would be to abandon MG4 in favor of MG5. MG5 has better support for what I call "MadEvent workspaces" i.e. once you define a process you want to study you can dump that process into a "MadEvent workspace" which can be run on its own. We would then only need to store the set of these ME workspaces which would isolate all the models into their own subdirectories. (this is already what idm and simp do).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants