Improve engine peformance on ancestry traversal #1174

Twisol · 2023-10-02T21:04:29Z

Tickets addressed: N/A
Review: By commit
Merge strategy: Merge (no squash)

Description

This PR attempts to address the needs of #1062 by improving the efficiency of existing capabilities rather than introducing a new capability. In particular, the "spawn-based trampolining" approach to reducing the replay overhead of long-lived replaying tasks was not viable because it produced an ever-growing ancestry lineage that blew up memory usage (and would have blown up results computation time if the simulation got that far). This PR enables the engine to prune any ancestry information that is neither relevant to (1) computing task span information nor (2) determining when a parent task should resume after a called child task completes.

In the course of making this change, the span (duration) information for each task has been factored out into its own concept; moreover, child tasks will by default contribute to their parent's span rather than constructing a new span. This makes tasks anonymous and purely internal by default. Instead, a task can now enter a new nested "scope of work", which is associated with a new span. The activity mappers generated by the Merlin framework now automatically enter and exit from these scopes, ensuring that every activity gets its own span, since the intended semantics of activities is that their scopes of work are meaningfully distinguished from any activities they were delegated from.

Verification

This PR has not yet been verified. I would like to at least run a reasonably-complex plan (on a reasonably-complex model) against both this branch and develop to smoke out any differences. Other ideas for verification would be welcome -- the engine is not especially friendly to unit testing at this time...

Documentation

Other than documenting new capabilities, no existing capabilities have changed, so long as models built prior to this change are recompiled for the new version of the framework. (The responsibility for entering new spans has effectively shifted from the engine to the framework.)

The new ability for a model to enter new spans needs to be documented.

Future work

Currently, the results computation filters out any spans that are not associated to activities. This seems overly restrictive: if a model wants to announce a stretch of work that is meaningfully distinguished from the surrounding work (e.g. an activity that is organized into setup, real work, and cleanup), that hierarchy should be communicated to the planner, regardless of whether it's implemented as a directive or as a mere sub-scope. A directive becomes simply a modeling pattern yielding a new span together with some special data (input and output attributes) associated to the span.

bradNASA · 2023-10-03T17:00:11Z

@Twisol , I talked to either @mattdailis , @DavidLegg , or both about how this would affect incremental simulation. The incremental sim implementation needs to be able to re-run tasks, and the appropriate parent needs to be re-run to make sure the appropriate context for the child is not lost. So, for example, if resource R has changed, and child task C from the previous simulation reads R, we need to re-run C in the appropriate context. The current code looks up the parent activity or daemon task P that spawned C and re-executes P. So, the question is can incremental sim either record (in cell read history) or lookup in a past simulation engine an appropriate activity/daemon to re-execute in the context these code changes?

The re-execution of P could be expensive in incremental simulation if P has many task children. An optimization on re-execution could be saving away lambda functions for some or all of the tasks. Doing this for every cell read would probably use more memory than that saved by the changes here. Thoughts?

Twisol · 2023-10-03T20:39:18Z

Hi Brad!

the appropriate parent needs to be re-run to make sure the appropriate context for the child is not lost.

I'm not entirely certain this is true. When a child is spawned, we are given a TaskFactory that lets us, in principle, create that same child task as many times as we want. We already make the assumption that any state held by the parent task that is observed by the child task is constant, and likewise that the child cannot mutate any state held by the parent; this follows from our general requirements on mutable state in a simulation. So we should be able to make the assumption that, if we need to re-run a child task, we only need to restart it from the TaskFactory that was provided for it. Any state a child lambda may reference in the parent scope is already closed over by the Java lambda -- it continues to exist as long as the lambda does.

bradNASA · 2023-10-03T21:34:31Z

Are there TaskFactorys for non-activity tasks, too? I thought that was only for activity tasks.

bradNASA · 2023-10-03T21:43:30Z

Are there TaskFactorys for non-activity tasks, too? I thought that was only for activity tasks.

So, I see this:

  public static <T> TaskFactory<T> threaded(final Supplier<T> task) {
    return executor -> new ThreadedTask<>(executor, ModelActions.context, task);
  }

But, can I get from a TaskId (or SpanId?) to a TaskFactory like I can with an ActivityId?

bradNASA · 2023-10-03T21:45:03Z

Saving the TaskFactory is basically the same as saving the lambda. I fear these could be very large in comparison to ids for Clipper, for example.

Twisol · 2023-10-04T14:56:49Z

Are there TaskFactorys for non-activity tasks, too? I thought that was only for activity tasks.

Yes, every task is given by a task factory. Both the spawn method used by tasks to spawn children, as well as the TaskStatus#CallingTask status used by tasks to call (block on) children, accept a TaskFactory. This design was put in place very early, specifically to allow us to restart tasks for incremental resimulation.

So, I see this: [...] But, can I get from a TaskId (or SpanId?) to a TaskFactory like I can with an ActivityId?

I'm not confident I understand how the code you've quoted relates to your question. At the moment, the engine throws away task factories when they're used once; holding on to them for longer (and associating them to task IDs) would definitely be necessary for incremental resimulation. The current engine is not incremental, so it doesn't keep factories around longer than it needs them for.

The alternative is, as you mentioned, re-running tasks up to the point where the desired task was spawned. As you (also) mentioned, that approach itself comes with tradeoffs -- mostly trading space for time, I think. The major issue I see is that, for daemon tasks, we might need to step the task forward many, many times to get back to the point the daemon was at, whether it's using a while loop or is trampolined by spawn. At least with spawn-based trampolining, the model can cache a replaying() TaskFactory once and use it for every subsequent spawn; we can then detect that and avoid creating and storing a new task factory for every iteration.

Saving the TaskFactory is basically the same as saving the lambda. I fear these could be very large in comparison to ids for Clipper, for example.

That depends on how much data the lambda references from outside its scope. Java (as with most languages) performs closure conversion on lambdas, meaning that the code of the lambda itself is only defined once ever. When a new instance of that lambda is created, its constructor accepts any data the lambda needs to reference from the caller's scope. So, the amount of data the lambda pins in memory long-term is going to depend precisely on how many things in the caller's scope the lambda depends on.

This is exactly the same situation as with activities: an activity contains a number of fields, and the activity's run method reads those fields. As long as the activity instance lives, so too do its fields. So you can think of closure conversion as a kind of "promotion" to anonymous activity types.

The span that a task contributes to can, and will change across a step.

Twisol had a problem deploying to e2e-test October 2, 2023 21:04 — with GitHub Actions Failure

mattdailis added the simulation Anything related to the simulation domain label Oct 6, 2023

DavidLegg mentioned this pull request Oct 24, 2023

Add TrampoliningTask #1062

Closed

dandelany assigned mattdailis Feb 1, 2024

mattdailis force-pushed the feature/castello--engine-efficiency branch from 14d5fab to 86cf706 Compare February 1, 2024 20:45

mattdailis had a problem deploying to e2e-test February 1, 2024 20:45 — with GitHub Actions Failure

mattdailis mentioned this pull request Feb 14, 2024

In-memory checkpoint simulation #1323

Merged

mattdailis mentioned this pull request Feb 28, 2024

Include more information with simulation error messages #659

Closed

mattdailis force-pushed the feature/castello--engine-efficiency branch from 86cf706 to 01c470d Compare March 1, 2024 14:32

mattdailis had a problem deploying to e2e-test March 1, 2024 15:28 — with GitHub Actions Failure

mattdailis force-pushed the feature/castello--engine-efficiency branch from 01c470d to 3fc628d Compare March 1, 2024 15:30

mattdailis had a problem deploying to e2e-test March 1, 2024 15:46 — with GitHub Actions Failure

dandelany assigned evandjpl Mar 7, 2024

mattdailis force-pushed the feature/castello--engine-efficiency branch from 3fc628d to 8edb2fd Compare March 19, 2024 18:19

mattdailis had a problem deploying to e2e-test March 19, 2024 18:19 — with GitHub Actions Failure

mattdailis force-pushed the feature/castello--engine-efficiency branch from 8edb2fd to 230edf8 Compare March 19, 2024 19:45

mattdailis had a problem deploying to e2e-test March 19, 2024 19:45 — with GitHub Actions Failure

mattdailis force-pushed the feature/castello--engine-efficiency branch from 230edf8 to d201218 Compare March 21, 2024 19:27

mattdailis had a problem deploying to e2e-test March 21, 2024 19:27 — with GitHub Actions Failure

mattdailis force-pushed the feature/castello--engine-efficiency branch from d201218 to 536a764 Compare March 27, 2024 03:19

mattdailis had a problem deploying to e2e-test March 27, 2024 03:19 — with GitHub Actions Failure

mattdailis force-pushed the feature/castello--engine-efficiency branch from 536a764 to c3244b8 Compare March 27, 2024 03:21

mattdailis had a problem deploying to e2e-test March 27, 2024 03:22 — with GitHub Actions Failure

mattdailis force-pushed the feature/castello--engine-efficiency branch from c3244b8 to 65494ad Compare March 27, 2024 03:33

Twisol and others added 25 commits March 28, 2024 12:33

Resume calling tasks by counting subtasks

23e2d1e

Organize tasks into a tree of spans

3e3abe8

Replace String with TaskId in several places

cc1a25e

Associate events to spans, not to tasks

83a2591

Rename several variables from task-based to span-based

967c36d

Change isTaskComplete and getTaskDuration to use Span information

76d43de

Remove obsolete task lifecycle stages

78e5795

Remove obsolete TaskSignalId

ae3295e

Reorganize fields and declarations

26ea90a

Remove isSpanComplete

7d10097

Rename subtasks to blockedTasks

3772484

Add an ability for tasks to descend into new spans

4e13406

Wrap every directive instance in the scope of a new span

2df4088

Remove spans around regular spawns and calls

65b9774

Fix span counting behavior

009127e

Rename spanTasks to spanContributors

ef675db

Update implementations of makeTaskFactory to push and pop spans

0ee175a

Use scheduler.span instead of progress.span

6a51f16

The span that a task contributes to can, and will change across a step.

Use set-based implementation as an oracle

9a21287

Increment parent count if child hasn't closed

256f36d

Remove set-based implementation

74c7e7f

Use switch for TaskStatus matching

34eb4e2

Use the term Output instead of Return for task output generic

b0f8809

Include full stack trace in test failures

aae5f1a

Use unique plan name in Scheduling e2e test

e5ae0f5

mattdailis force-pushed the feature/castello--engine-efficiency branch from 99bae03 to e5ae0f5 Compare March 28, 2024 19:33

mattdailis had a problem deploying to e2e-test March 28, 2024 19:33 — with GitHub Actions Failure

mattdailis merged commit 53643b3 into NASA-AMMOS:develop Mar 28, 2024
5 of 6 checks passed

Twisol deleted the feature/castello--engine-efficiency branch March 28, 2024 22:57

Twisol mentioned this pull request Mar 29, 2024

Enable tighter attribution of exceptions to spans #1380

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve engine peformance on ancestry traversal #1174

Improve engine peformance on ancestry traversal #1174

Twisol commented Oct 2, 2023 •

edited by mattdailis

Loading

bradNASA commented Oct 3, 2023 •

edited

Loading

Twisol commented Oct 3, 2023 •

edited

Loading

bradNASA commented Oct 3, 2023

bradNASA commented Oct 3, 2023

bradNASA commented Oct 3, 2023 •

edited

Loading

Twisol commented Oct 4, 2023 •

edited

Loading

Improve engine peformance on ancestry traversal #1174

Improve engine peformance on ancestry traversal #1174

Conversation

Twisol commented Oct 2, 2023 • edited by mattdailis Loading

Description

Verification

Documentation

Future work

bradNASA commented Oct 3, 2023 • edited Loading

Twisol commented Oct 3, 2023 • edited Loading

bradNASA commented Oct 3, 2023

bradNASA commented Oct 3, 2023

bradNASA commented Oct 3, 2023 • edited Loading

Twisol commented Oct 4, 2023 • edited Loading

Twisol commented Oct 2, 2023 •

edited by mattdailis

Loading

bradNASA commented Oct 3, 2023 •

edited

Loading

Twisol commented Oct 3, 2023 •

edited

Loading

bradNASA commented Oct 3, 2023 •

edited

Loading

Twisol commented Oct 4, 2023 •

edited

Loading