Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step Function Retry State Not Persisted Across Application Restarts #683

Open
serejke opened this issue Dec 18, 2024 · 9 comments
Open

Step Function Retry State Not Persisted Across Application Restarts #683

serejke opened this issue Dec 18, 2024 · 9 comments

Comments

@serejke
Copy link

serejke commented Dec 18, 2024

while (result === dbosNull && numAttempts++ < ctxt.maxAttempts) {

Description

Currently, step function retry logic is maintained only in memory, which leads to a durability issue when the application restarts. This means that if a step function is configured with retries (e.g., max 5 retries with 1-hour delays) and the application crashes or restarts, the retry count and state are lost, and the step function will start fresh with 0 retries when the application comes back up.

Current Behavior

  1. Step function fails and starts retry mechanism
  2. Application crashes/restarts during retry process
  3. When application restarts, the step function's retry count resets to 0
  4. All previous retry attempts are lost, effectively allowing more retries than configured

Expected Behavior

  1. Step function's retry state should be persisted in the system database
  2. After application restart, the step function should resume with its previous retry count
  3. Total number of retries should respect the configured maximum across application restarts

Technical Details

The issue exists in dbos-executor.ts in the callStepFunction method. The retry logic is maintained purely in memory

Impact

This issue affects the reliability and consistency of step function retry policies, particularly for long-running retry scenarios or in production environments where application restarts are common.

Suggested Solution

  1. Add a new table in the system database to track step function retry state
  2. Store attempt count and last retry timestamp for each step function execution
  3. On application restart, recover the retry state from the database
  4. Use the persisted state to continue retries from where they left off

Labels

  • bug
  • reliability
  • durability
@kraftp
Copy link
Member

kraftp commented Dec 19, 2024

Thanks for filing this issue!

This omission is intentional—it isn't clear to us that it would be useful to make the number of retries durable.

That said, we'd be happy to support it if there was a clear use-case for it (and like you say, it wouldn't be very hard).

Would love to learn more about what you're building! Under what circumstances does the number of retries need to be durable?

@serejke
Copy link
Author

serejke commented Dec 20, 2024

I believe the effective number of retries must not depend on the liveness of the application. The retry policy is a business logic's decision chosen carefully for each step of a complex workflow. It should be straightforward to reason about the durable business logic rules carried out by DBOS than to think of corner cases.

Example: If a workflow step requires a retry delay of 1 hour (e.g. to avoid flooding users with notifications), and max 5 retry attempts, the total "lifetime" of this step is 5 hours. During the 5 hours of the application a lot can happen: the app can crash multiple times, for unrelated reasons. Maybe the app keeps restarting every 4 hours because of redeployment. In this case the retries count is reset every time, potentially being infinite.

@kraftp
Copy link
Member

kraftp commented Dec 20, 2024

Got it, then this is something we'll take a look at, although likely not until after the release of TSv2.

@chuck-dbos
Copy link
Collaborator

Agree with @kraftp ... I thought this was left out intentionally. The retry logic in steps is very primitive, and the configuration quite limited.

Once you get beyond that, you'd really be better off making your retry processing into a workflow. Then you have durability on each attempt at performing the step, and durability on the sleeps, and full customizability on the logic of how long each sleep should be, whether different errors are handled differently, etc. As you said, "the retry policy is a business logic's decision chosen carefully for each step of a complex workflow", and that is what workflows are good at, so child workflows come to mind first.

If we implement this, I would not suggest adding new tables and so on, but automatically wrapping the step communications in retry workflows. Then they would get the benefit of existing durability, debug, tracing, etc. We already have the concept of "temp" workflows that wrap steps, in case no workflow is in progress, this would be an extension of that mechanism.

@paradoxloop
Copy link

@kraftp A reason to make retries durable or the in-memory choice explicit is that people might rely on retries to cap costs, such as calling a $10 endpoint only 3 times before failing permanently. Storing this in memory removes that limit, allowing unlimited retries. Retries should be durable because the retry count reflects a business decision.

The workaround of storing "attempts" separately from the DBO retry count, which resets on crashes, feels odd.

@chuck-dbos
Copy link
Collaborator

Disagree. Retries are lightweight. In your hypothetical example, if you are only going to try 3 times at $10 each I would think you'd want to differentiate between transient and non-transient errors that might indicate whether you'd be charged for the failure or not (maybe a network error that prevented any contact at all). That's a workflow decision, not a communicator retry. Then, the retry state will persist. And any information retrieved that would help with diagnostics will also persist. As it is (up to) 3 calls to the step, not one with a loop inside.

@chuck-dbos
Copy link
Collaborator

That's another thing, I hope if you really have $10/call API, it accepts a request key, so you can retrieve the request result if it already occurred, rather than paying for a new one.

@paradoxloop
Copy link

I’m making two distinct points that I confusingly merged into one paragraph—my mistake:

  1. Explicitness of Current Behavior
    The documentation states "it retries step functions a set number of times", which seems like a strong assertion about the retry count.

    However, if retry counts are stored in memory, they’re more like a minimumNumberOfRetries rather than a numberOfRetries, since the actual count could be infinite if crashes occur at the wrong time (e.g., during fetch/axios return data parsing). This should be clarified in the documentation.

  2. Desire for Durable Retry Count
    If making retry counts durable is "cheap," it would make the dbos system easier to reason about and allow for more explicit configuration of system behavior.


Regarding your aside on API behavior:

That's another thing, I hope if you really have $10/call API, it accepts a request key, so you can retrieve the request result if it already occurred, rather than paying for a new one.

I agree, but unfortunately, some APIs we work with charge for requests that return unknown errors and lack list endpoints to confirm if a request succeeded.

@chuck-dbos
Copy link
Collaborator

We can adjust the doc for clarity. Note that because we perform the step, then write down the result, the retry count is always going to be a lower bound on the number of attempts that will be made.

I'm not saying you shouldn't be able to get a more precise behavior, I'm saying that workflows is the way, not automatic step retries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants