Step Function Retry State Not Persisted Across Application Restarts #683

serejke · 2024-12-18T10:25:42Z

Line 1103 in 21e8a0e

while (result === dbosNull && numAttempts++ < ctxt.maxAttempts) {

Description

Currently, step function retry logic is maintained only in memory, which leads to a durability issue when the application restarts. This means that if a step function is configured with retries (e.g., max 5 retries with 1-hour delays) and the application crashes or restarts, the retry count and state are lost, and the step function will start fresh with 0 retries when the application comes back up.

Current Behavior

Step function fails and starts retry mechanism
Application crashes/restarts during retry process
When application restarts, the step function's retry count resets to 0
All previous retry attempts are lost, effectively allowing more retries than configured

Expected Behavior

Step function's retry state should be persisted in the system database
After application restart, the step function should resume with its previous retry count
Total number of retries should respect the configured maximum across application restarts

Technical Details

The issue exists in dbos-executor.ts in the callStepFunction method. The retry logic is maintained purely in memory

Impact

This issue affects the reliability and consistency of step function retry policies, particularly for long-running retry scenarios or in production environments where application restarts are common.

Labels

bug
reliability
durability

The text was updated successfully, but these errors were encountered:

kraftp · 2024-12-19T22:46:52Z

Thanks for filing this issue!

This omission is intentional—it isn't clear to us that it would be useful to make the number of retries durable.

That said, we'd be happy to support it if there was a clear use-case for it (and like you say, it wouldn't be very hard).

Would love to learn more about what you're building! Under what circumstances does the number of retries need to be durable?

serejke · 2024-12-20T08:58:33Z

I believe the effective number of retries must not depend on the liveness of the application. The retry policy is a business logic's decision chosen carefully for each step of a complex workflow. It should be straightforward to reason about the durable business logic rules carried out by DBOS than to think of corner cases.

Example: If a workflow step requires a retry delay of 1 hour (e.g. to avoid flooding users with notifications), and max 5 retry attempts, the total "lifetime" of this step is 5 hours. During the 5 hours of the application a lot can happen: the app can crash multiple times, for unrelated reasons. Maybe the app keeps restarting every 4 hours because of redeployment. In this case the retries count is reset every time, potentially being infinite.

kraftp · 2024-12-20T18:16:55Z

Got it, then this is something we'll take a look at, although likely not until after the release of TSv2.

chuck-dbos · 2024-12-28T12:54:22Z

Agree with @kraftp ... I thought this was left out intentionally. The retry logic in steps is very primitive, and the configuration quite limited.

Once you get beyond that, you'd really be better off making your retry processing into a workflow. Then you have durability on each attempt at performing the step, and durability on the sleeps, and full customizability on the logic of how long each sleep should be, whether different errors are handled differently, etc. As you said, "the retry policy is a business logic's decision chosen carefully for each step of a complex workflow", and that is what workflows are good at, so child workflows come to mind first.

If we implement this, I would not suggest adding new tables and so on, but automatically wrapping the step communications in retry workflows. Then they would get the benefit of existing durability, debug, tracing, etc. We already have the concept of "temp" workflows that wrap steps, in case no workflow is in progress, this would be an extension of that mechanism.

paradoxloop · 2024-12-30T17:11:25Z

@kraftp A reason to make retries durable or the in-memory choice explicit is that people might rely on retries to cap costs, such as calling a $10 endpoint only 3 times before failing permanently. Storing this in memory removes that limit, allowing unlimited retries. Retries should be durable because the retry count reflects a business decision.

The workaround of storing "attempts" separately from the DBO retry count, which resets on crashes, feels odd.

chuck-dbos · 2024-12-30T17:18:55Z

Disagree. Retries are lightweight. In your hypothetical example, if you are only going to try 3 times at $10 each I would think you'd want to differentiate between transient and non-transient errors that might indicate whether you'd be charged for the failure or not (maybe a network error that prevented any contact at all). That's a workflow decision, not a communicator retry. Then, the retry state will persist. And any information retrieved that would help with diagnostics will also persist. As it is (up to) 3 calls to the step, not one with a loop inside.

chuck-dbos · 2024-12-30T17:37:08Z

That's another thing, I hope if you really have $10/call API, it accepts a request key, so you can retrieve the request result if it already occurred, rather than paying for a new one.

paradoxloop · 2024-12-30T18:28:50Z

I’m making two distinct points that I confusingly merged into one paragraph—my mistake:

Explicitness of Current Behavior
The documentation states "it retries step functions a set number of times", which seems like a strong assertion about the retry count.

However, if retry counts are stored in memory, they’re more like a minimumNumberOfRetries rather than a numberOfRetries, since the actual count could be infinite if crashes occur at the wrong time (e.g., during fetch/axios return data parsing). This should be clarified in the documentation.
Desire for Durable Retry Count
If making retry counts durable is "cheap," it would make the dbos system easier to reason about and allow for more explicit configuration of system behavior.

Regarding your aside on API behavior:

That's another thing, I hope if you really have $10/call API, it accepts a request key, so you can retrieve the request result if it already occurred, rather than paying for a new one.

I agree, but unfortunately, some APIs we work with charge for requests that return unknown errors and lack list endpoints to confirm if a request succeeded.

chuck-dbos · 2024-12-30T19:53:01Z

We can adjust the doc for clarity. Note that because we perform the step, then write down the result, the retry count is always going to be a lower bound on the number of attempts that will be made.

I'm not saying you shouldn't be able to get a more precise behavior, I'm saying that workflows is the way, not automatic step retries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Step Function Retry State Not Persisted Across Application Restarts #683

Step Function Retry State Not Persisted Across Application Restarts #683

serejke commented Dec 18, 2024

kraftp commented Dec 19, 2024 •

edited

Loading

serejke commented Dec 20, 2024

kraftp commented Dec 20, 2024

chuck-dbos commented Dec 28, 2024

paradoxloop commented Dec 30, 2024

chuck-dbos commented Dec 30, 2024

chuck-dbos commented Dec 30, 2024

paradoxloop commented Dec 30, 2024

chuck-dbos commented Dec 30, 2024

Step Function Retry State Not Persisted Across Application Restarts #683

Step Function Retry State Not Persisted Across Application Restarts #683

Comments

serejke commented Dec 18, 2024

Description

Current Behavior

Expected Behavior

Technical Details

Impact

Suggested Solution

Labels

kraftp commented Dec 19, 2024 • edited Loading

serejke commented Dec 20, 2024

kraftp commented Dec 20, 2024

chuck-dbos commented Dec 28, 2024

paradoxloop commented Dec 30, 2024

chuck-dbos commented Dec 30, 2024

chuck-dbos commented Dec 30, 2024

paradoxloop commented Dec 30, 2024

chuck-dbos commented Dec 30, 2024

kraftp commented Dec 19, 2024 •

edited

Loading