reduce simulation memory usage for long plans #1337

martyvona · 2024-02-17T00:37:08Z

This PR contains several adjustments to reduce memory usage for long plans. These were developed by analyzing memory dumps acquired while simulating a 5-year Clipper plan with about 7.5k activities. This plan was running out of memory in the default simulation configuration with the standard 32GB docker memory limit.

With the changes in this PR, combined with changes in a corresponding clipper-aerie PR, that plan now successfully simulates with a 32GB docker memory limit, I believe with at least a few GB to spare.

The changes in these PRs address several memory bottlenecks. In this specific Aerie PR those are:

The "trampolining" approach used for runtime performance in Clipper's ExtendedModelActions.spawn(RepeatingTask) appears to be a form of tail recursion, but it without tall call optimization, this was using a lot of memory. There are several possible ways to address this; here I'm proposing a relatively minimal way that I feel is practical though perhaps not as elegant as some other more complex possibilities.
The current implementation of simulation results builds up the entire history of all resource profiles in memory. While there is some intention that this will eventually be replaced by streaming the data incrementally in some form, for now this is a memory bottleneck. This PR adds some new SerializedValue implementations intended to help avoid using a large number of BigDecimal instances in this codepath. It also add an opt-in feature that allows resource profiles to use a form of run-length compression.

This is still a draft PR for now for the following reasons:

I would still like to run some additional tests with docker configured with memory limits that more closely mimic actual deployments
DONE ~~I didn't know about ./gradlew e2eTest, looks like there are some regressions I will need to fix there.~~
I still need to run @DavidLegg's compare_resources.py to make sure that there are no regressions in the simulation results. I suppose I will do this by running only some prefix of the test plan to stay within memory limits in the baseline configuration.

…ot 1

martyvona · 2024-03-20T17:24:50Z

When I first worked on this about 6-8 weeks ago I got it to simulate the full 5y plan in something like 24G peak heap. However, more recent tests are failing again. I was looking into that at the point where I was asked to stop working on it. One of the things that changed recently is that the eurc model now is doing a better job of validating and running GNC turn activities where in previous versions it might have neither errored out nor actually simulated because of things like exceeding acceleration limits. So that same 5y plan that we have been working with now requires some tweaks to simulate at all, but once we make those, it uses a lot more memory than before because it's actually simulating more of the turns that it was before.

My latest result with that work was the 5y plan simulates about halfway through and then the simulation fails. Not becuse it ran out of heap, but actually because it hits some combination of activities at that point that were causing a concurrent modification error. Though I believe the heap memory usage was already in the mid 20Gs as well by that point, so if we fix that concurrent modification I would still be surprised if it could get a lot further without running out of memory.

Simulation failed. Response:
{'status': 'failed', 'reason': {'data': {'elapsedTime': '21307:04:36.000000', 'utcTimeDoy': '2027-122T07:04:36'}, 'message': '', 'timestamp': '2024-03-13T07:57:44.498576429Z.498576', 'trace': 'java.lang.UnsupportedOperationException: Conflicting concurrent effects on the same cell. Please disambiguate model to remove conflicting operations: SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(OFF)]] and SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(ON)]]

I was still working to understand what is still using a lot of memory but it was perhaps looking like about equal parts (a) resource timelines (even just a single segment of a scalar valued numeric resource timeline consumes something like 70 bytes iirc) and (b) simulation engine task frame histories.

martyvona self-assigned this Feb 17, 2024

martyvona had a problem deploying to e2e-test February 17, 2024 00:37 — with GitHub Actions Failure

martyvona temporarily deployed to e2e-test February 20, 2024 23:25 — with GitHub Actions Inactive

martyvona temporarily deployed to e2e-test February 20, 2024 23:29 — with GitHub Actions Inactive

Marsette Vona added 6 commits March 1, 2024 19:33

reduce memory usage for tasks in AwaitingChildren state

7029aeb

add tail call optimization for tasks

43959cf

avoid BigDecimal when creating SerializedValue from double, int, or long

7526784

implement option for run length compression in resource profiles

6291dfc

fix e2eTest regression - always serialize double values as e.g. 1.0 n…

ae99f9f

…ot 1

fix grammar in comment

0794686

martyvona force-pushed the fix/sim-memory-usage-for-long-plans branch from d70bc26 to 0794686 Compare March 2, 2024 00:49

martyvona temporarily deployed to e2e-test March 2, 2024 00:49 — with GitHub Actions Inactive

joswig added this to the FY24 Q2 - Simulation performance optimization and guidance milestone Mar 11, 2024

joswig modified the milestones: FY24 Q2 - Simulation performance optimization and guidance, FY24 Q3 - Ad Hoc Improvements May 8, 2024

joswig modified the milestones: FY24 Q3 - Ad Hoc Improvements, FY24 Q4 - Ad Hoc Improvements Jul 5, 2024

joswig modified the milestones: FY24 Q4 - Ad Hoc Improvements, FY25 Q1 - Ad Hoc Improvements Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce simulation memory usage for long plans #1337

reduce simulation memory usage for long plans #1337

martyvona commented Feb 17, 2024 •

edited

Loading

martyvona commented Mar 20, 2024 •

edited

Loading

reduce simulation memory usage for long plans #1337

Are you sure you want to change the base?

reduce simulation memory usage for long plans #1337

Conversation

martyvona commented Feb 17, 2024 • edited Loading

martyvona commented Mar 20, 2024 • edited Loading

martyvona commented Feb 17, 2024 •

edited

Loading

martyvona commented Mar 20, 2024 •

edited

Loading