Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce simulation memory usage for long plans #1337

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

martyvona
Copy link

@martyvona martyvona commented Feb 17, 2024

This PR contains several adjustments to reduce memory usage for long plans. These were developed by analyzing memory dumps acquired while simulating a 5-year Clipper plan with about 7.5k activities. This plan was running out of memory in the default simulation configuration with the standard 32GB docker memory limit.

With the changes in this PR, combined with changes in a corresponding clipper-aerie PR, that plan now successfully simulates with a 32GB docker memory limit, I believe with at least a few GB to spare.

The changes in these PRs address several memory bottlenecks. In this specific Aerie PR those are:

  • The "trampolining" approach used for runtime performance in Clipper's ExtendedModelActions.spawn(RepeatingTask) appears to be a form of tail recursion, but it without tall call optimization, this was using a lot of memory. There are several possible ways to address this; here I'm proposing a relatively minimal way that I feel is practical though perhaps not as elegant as some other more complex possibilities.
  • The current implementation of simulation results builds up the entire history of all resource profiles in memory. While there is some intention that this will eventually be replaced by streaming the data incrementally in some form, for now this is a memory bottleneck. This PR adds some new SerializedValue implementations intended to help avoid using a large number of BigDecimal instances in this codepath. It also add an opt-in feature that allows resource profiles to use a form of run-length compression.

This is still a draft PR for now for the following reasons:

  1. I would still like to run some additional tests with docker configured with memory limits that more closely mimic actual deployments
  2. DONE I didn't know about ./gradlew e2eTest, looks like there are some regressions I will need to fix there.
  3. I still need to run @DavidLegg's compare_resources.py to make sure that there are no regressions in the simulation results. I suppose I will do this by running only some prefix of the test plan to stay within memory limits in the baseline configuration.

@martyvona
Copy link
Author

martyvona commented Mar 20, 2024

When I first worked on this about 6-8 weeks ago I got it to simulate the full 5y plan in something like 24G peak heap. However, more recent tests are failing again. I was looking into that at the point where I was asked to stop working on it. One of the things that changed recently is that the eurc model now is doing a better job of validating and running GNC turn activities where in previous versions it might have neither errored out nor actually simulated because of things like exceeding acceleration limits. So that same 5y plan that we have been working with now requires some tweaks to simulate at all, but once we make those, it uses a lot more memory than before because it's actually simulating more of the turns that it was before.

My latest result with that work was the 5y plan simulates about halfway through and then the simulation fails. Not becuse it ran out of heap, but actually because it hits some combination of activities at that point that were causing a concurrent modification error. Though I believe the heap memory usage was already in the mid 20Gs as well by that point, so if we fix that concurrent modification I would still be surprised if it could get a lot further without running out of memory.

Simulation failed. Response:
{'status': 'failed', 'reason': {'data': {'elapsedTime': '21307:04:36.000000', 'utcTimeDoy': '2027-122T07:04:36'}, 'message': '', 'timestamp': '2024-03-13T07:57:44.498576429Z.498576', 'trace': 'java.lang.UnsupportedOperationException: Conflicting concurrent effects on the same cell. Please disambiguate model to remove conflicting operations: SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(OFF)]] and SingletonEffect[effect=SetEffect[newDynamics=DiscreteDynamics(ON)]]

I was still working to understand what is still using a lot of memory but it was perhaps looking like about equal parts (a) resource timelines (even just a single segment of a scalar valued numeric resource timeline consumes something like 70 bytes iirc) and (b) simulation engine task frame histories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants