Reduce file duplication in MadGraph #389

tomeichlersmith · 2023-05-25T18:03:40Z

Just reporting some statistics here - there is a lot of duplication of code with the current method for storing MadGraph files for later running. MG5 is bad but MG4 is even worse with some files being copied over 40 times. The tar-ball below has full listings of all the files and their md5sum as well as a sorted list of uniq md5sum with a corresponding example file. This does not handle duplicate code within files but it is a start.

mg-unique-file-listing.tar.gz

How

calculate md5sum of each file¹

cd generators/madgraphN
fd -tf -x md5sum | sort > md5sum.list

get uniq files sorted by number of copies

uniq -c -w 32 md5sum.list | sort -nr > uniq.list

using fd instead of find here since its faster. The find equivalent is find -type f -exec md5sum {} ';' | sort > md5sum.list ↩

The text was updated successfully, but these errors were encountered:

JeremyMcCormick · 2023-06-21T21:58:19Z

@tomeichlersmith After looking at this a bit, what is your conclusion about how feasible this is for us to change how we are already doing things?

From what I remember, hps-mc copies some portion of the entire source tree into the run directory with the MG components. My main concern would be that we are possibly copying in a lot of extra files that we don't need, but I don't know whether this is the case or not.

What is the difference between how MG4 and MG5 handle all this? Is MG5 better in some way with less file duplication?

tomeichlersmith · 2023-06-22T16:36:26Z

I will answer back-to-front.

difference between MG4 and MG5

MG5 is indeed better than MG4 in terms of avoiding file copying - that was one of the major updates that led to the major version increase.

possibly coyping in a lot of extra files that we don't need

We are almost certainly doing that. One of the issues is that some MG source files are used in one model and not used in another, so we'd need to check all of the different models we wish to support when attempting to delete any files.

feasibility

Its definitely feasible. There are several avenues of improvement but the big issue is time. Does anyone have time to do these things? Probably not...

One thing I've done in the past is run the program and then check which files were accessed/read. This at least eliminates files that aren't even opened by MG/ME during running.
Another avenue of improvement would be to abandon MG4 in favor of MG5. MG5 has better support for what I call "MadEvent workspaces" i.e. once you define a process you want to study you can dump that process into a "MadEvent workspace" which can be run on its own. We would then only need to store the set of these ME workspaces which would isolate all the models into their own subdirectories. (this is already what idm and simp do).

JeremyMcCormick assigned tomeichlersmith and JeremyMcCormick May 29, 2023

JeremyMcCormick added the cleanup label May 29, 2023

JeremyMcCormick added this to the 2.1.0 milestone May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce file duplication in MadGraph #389

Reduce file duplication in MadGraph #389

tomeichlersmith commented May 25, 2023 •

edited

Loading

JeremyMcCormick commented Jun 21, 2023 •

edited

Loading

tomeichlersmith commented Jun 22, 2023

Reduce file duplication in MadGraph #389

Reduce file duplication in MadGraph #389

Comments

tomeichlersmith commented May 25, 2023 • edited Loading

How

Footnotes

JeremyMcCormick commented Jun 21, 2023 • edited Loading

tomeichlersmith commented Jun 22, 2023

tomeichlersmith commented May 25, 2023 •

edited

Loading

JeremyMcCormick commented Jun 21, 2023 •

edited

Loading