Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve engine peformance on ancestry traversal #1174

Merged

Conversation

Twisol
Copy link
Contributor

@Twisol Twisol commented Oct 2, 2023

  • Tickets addressed: N/A
  • Review: By commit
  • Merge strategy: Merge (no squash)

Description

This PR attempts to address the needs of #1062 by improving the efficiency of existing capabilities rather than introducing a new capability. In particular, the "spawn-based trampolining" approach to reducing the replay overhead of long-lived replaying tasks was not viable because it produced an ever-growing ancestry lineage that blew up memory usage (and would have blown up results computation time if the simulation got that far). This PR enables the engine to prune any ancestry information that is neither relevant to (1) computing task span information nor (2) determining when a parent task should resume after a called child task completes.

In the course of making this change, the span (duration) information for each task has been factored out into its own concept; moreover, child tasks will by default contribute to their parent's span rather than constructing a new span. This makes tasks anonymous and purely internal by default. Instead, a task can now enter a new nested "scope of work", which is associated with a new span. The activity mappers generated by the Merlin framework now automatically enter and exit from these scopes, ensuring that every activity gets its own span, since the intended semantics of activities is that their scopes of work are meaningfully distinguished from any activities they were delegated from.

Verification

  • This PR has not yet been verified. I would like to at least run a reasonably-complex plan (on a reasonably-complex model) against both this branch and develop to smoke out any differences. Other ideas for verification would be welcome -- the engine is not especially friendly to unit testing at this time...

Documentation

Other than documenting new capabilities, no existing capabilities have changed, so long as models built prior to this change are recompiled for the new version of the framework. (The responsibility for entering new spans has effectively shifted from the engine to the framework.)

  • The new ability for a model to enter new spans needs to be documented.

Future work

  • Currently, the results computation filters out any spans that are not associated to activities. This seems overly restrictive: if a model wants to announce a stretch of work that is meaningfully distinguished from the surrounding work (e.g. an activity that is organized into setup, real work, and cleanup), that hierarchy should be communicated to the planner, regardless of whether it's implemented as a directive or as a mere sub-scope. A directive becomes simply a modeling pattern yielding a new span together with some special data (input and output attributes) associated to the span.

@bradNASA
Copy link
Contributor

bradNASA commented Oct 3, 2023

@Twisol , I talked to either @mattdailis , @DavidLegg , or both about how this would affect incremental simulation. The incremental sim implementation needs to be able to re-run tasks, and the appropriate parent needs to be re-run to make sure the appropriate context for the child is not lost. So, for example, if resource R has changed, and child task C from the previous simulation reads R, we need to re-run C in the appropriate context. The current code looks up the parent activity or daemon task P that spawned C and re-executes P. So, the question is can incremental sim either record (in cell read history) or lookup in a past simulation engine an appropriate activity/daemon to re-execute in the context these code changes?

The re-execution of P could be expensive in incremental simulation if P has many task children. An optimization on re-execution could be saving away lambda functions for some or all of the tasks. Doing this for every cell read would probably use more memory than that saved by the changes here. Thoughts?

@Twisol
Copy link
Contributor Author

Twisol commented Oct 3, 2023

Hi Brad!

the appropriate parent needs to be re-run to make sure the appropriate context for the child is not lost.

I'm not entirely certain this is true. When a child is spawned, we are given a TaskFactory that lets us, in principle, create that same child task as many times as we want. We already make the assumption that any state held by the parent task that is observed by the child task is constant, and likewise that the child cannot mutate any state held by the parent; this follows from our general requirements on mutable state in a simulation. So we should be able to make the assumption that, if we need to re-run a child task, we only need to restart it from the TaskFactory that was provided for it. Any state a child lambda may reference in the parent scope is already closed over by the Java lambda -- it continues to exist as long as the lambda does.

@bradNASA
Copy link
Contributor

bradNASA commented Oct 3, 2023

Are there TaskFactorys for non-activity tasks, too? I thought that was only for activity tasks.

@bradNASA
Copy link
Contributor

bradNASA commented Oct 3, 2023

Are there TaskFactorys for non-activity tasks, too? I thought that was only for activity tasks.

So, I see this:

  public static <T> TaskFactory<T> threaded(final Supplier<T> task) {
    return executor -> new ThreadedTask<>(executor, ModelActions.context, task);
  }

But, can I get from a TaskId (or SpanId?) to a TaskFactory like I can with an ActivityId?

@bradNASA
Copy link
Contributor

bradNASA commented Oct 3, 2023

Saving the TaskFactory is basically the same as saving the lambda. I fear these could be very large in comparison to ids for Clipper, for example.

@Twisol
Copy link
Contributor Author

Twisol commented Oct 4, 2023

Are there TaskFactorys for non-activity tasks, too? I thought that was only for activity tasks.

Yes, every task is given by a task factory. Both the spawn method used by tasks to spawn children, as well as the TaskStatus#CallingTask status used by tasks to call (block on) children, accept a TaskFactory. This design was put in place very early, specifically to allow us to restart tasks for incremental resimulation.

So, I see this: [...] But, can I get from a TaskId (or SpanId?) to a TaskFactory like I can with an ActivityId?

I'm not confident I understand how the code you've quoted relates to your question. At the moment, the engine throws away task factories when they're used once; holding on to them for longer (and associating them to task IDs) would definitely be necessary for incremental resimulation. The current engine is not incremental, so it doesn't keep factories around longer than it needs them for.

The alternative is, as you mentioned, re-running tasks up to the point where the desired task was spawned. As you (also) mentioned, that approach itself comes with tradeoffs -- mostly trading space for time, I think. The major issue I see is that, for daemon tasks, we might need to step the task forward many, many times to get back to the point the daemon was at, whether it's using a while loop or is trampolined by spawn. At least with spawn-based trampolining, the model can cache a replaying() TaskFactory once and use it for every subsequent spawn; we can then detect that and avoid creating and storing a new task factory for every iteration.

Saving the TaskFactory is basically the same as saving the lambda. I fear these could be very large in comparison to ids for Clipper, for example.

That depends on how much data the lambda references from outside its scope. Java (as with most languages) performs closure conversion on lambdas, meaning that the code of the lambda itself is only defined once ever. When a new instance of that lambda is created, its constructor accepts any data the lambda needs to reference from the caller's scope. So, the amount of data the lambda pins in memory long-term is going to depend precisely on how many things in the caller's scope the lambda depends on.

This is exactly the same situation as with activities: an activity contains a number of fields, and the activity's run method reads those fields. As long as the activity instance lives, so too do its fields. So you can think of closure conversion as a kind of "promotion" to anonymous activity types.

@mattdailis mattdailis added the simulation Anything related to the simulation domain label Oct 6, 2023
@DavidLegg DavidLegg mentioned this pull request Oct 24, 2023
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 14d5fab to 86cf706 Compare February 1, 2024 20:45
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 86cf706 to 01c470d Compare March 1, 2024 14:32
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 01c470d to 3fc628d Compare March 1, 2024 15:30
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 3fc628d to 8edb2fd Compare March 19, 2024 18:19
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 8edb2fd to 230edf8 Compare March 19, 2024 19:45
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 230edf8 to d201218 Compare March 21, 2024 19:27
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from d201218 to 536a764 Compare March 27, 2024 03:19
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 536a764 to c3244b8 Compare March 27, 2024 03:21
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from c3244b8 to 65494ad Compare March 27, 2024 03:33
@mattdailis mattdailis force-pushed the feature/castello--engine-efficiency branch from 99bae03 to e5ae0f5 Compare March 28, 2024 19:33
@mattdailis mattdailis merged commit 53643b3 into NASA-AMMOS:develop Mar 28, 2024
5 of 6 checks passed
@Twisol Twisol deleted the feature/castello--engine-efficiency branch March 28, 2024 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
simulation Anything related to the simulation domain
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants