-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Step Function Retry State Not Persisted Across Application Restarts #683
Comments
Thanks for filing this issue! This omission is intentional—it isn't clear to us that it would be useful to make the number of retries durable. That said, we'd be happy to support it if there was a clear use-case for it (and like you say, it wouldn't be very hard). Would love to learn more about what you're building! Under what circumstances does the number of retries need to be durable? |
I believe the effective number of retries must not depend on the liveness of the application. The retry policy is a business logic's decision chosen carefully for each step of a complex workflow. It should be straightforward to reason about the durable business logic rules carried out by DBOS than to think of corner cases. Example: If a workflow step requires a retry delay of 1 hour (e.g. to avoid flooding users with notifications), and max 5 retry attempts, the total "lifetime" of this step is 5 hours. During the 5 hours of the application a lot can happen: the app can crash multiple times, for unrelated reasons. Maybe the app keeps restarting every 4 hours because of redeployment. In this case the retries count is reset every time, potentially being infinite. |
Got it, then this is something we'll take a look at, although likely not until after the release of TSv2. |
Agree with @kraftp ... I thought this was left out intentionally. The retry logic in steps is very primitive, and the configuration quite limited. Once you get beyond that, you'd really be better off making your retry processing into a workflow. Then you have durability on each attempt at performing the step, and durability on the sleeps, and full customizability on the logic of how long each sleep should be, whether different errors are handled differently, etc. As you said, "the retry policy is a business logic's decision chosen carefully for each step of a complex workflow", and that is what workflows are good at, so child workflows come to mind first. If we implement this, I would not suggest adding new tables and so on, but automatically wrapping the step communications in retry workflows. Then they would get the benefit of existing durability, debug, tracing, etc. We already have the concept of "temp" workflows that wrap steps, in case no workflow is in progress, this would be an extension of that mechanism. |
@kraftp A reason to make retries durable or the in-memory choice explicit is that people might rely on retries to cap costs, such as calling a $10 endpoint only 3 times before failing permanently. Storing this in memory removes that limit, allowing unlimited retries. Retries should be durable because the retry count reflects a business decision. The workaround of storing "attempts" separately from the DBO retry count, which resets on crashes, feels odd. |
Disagree. Retries are lightweight. In your hypothetical example, if you are only going to try 3 times at $10 each I would think you'd want to differentiate between transient and non-transient errors that might indicate whether you'd be charged for the failure or not (maybe a network error that prevented any contact at all). That's a workflow decision, not a communicator retry. Then, the retry state will persist. And any information retrieved that would help with diagnostics will also persist. As it is (up to) 3 calls to the step, not one with a loop inside. |
That's another thing, I hope if you really have $10/call API, it accepts a request key, so you can retrieve the request result if it already occurred, rather than paying for a new one. |
I’m making two distinct points that I confusingly merged into one paragraph—my mistake:
Regarding your aside on API behavior:
I agree, but unfortunately, some APIs we work with charge for requests that return unknown errors and lack list endpoints to confirm if a request succeeded. |
We can adjust the doc for clarity. Note that because we perform the step, then write down the result, the retry count is always going to be a lower bound on the number of attempts that will be made. I'm not saying you shouldn't be able to get a more precise behavior, I'm saying that workflows is the way, not automatic step retries. |
dbos-transact-ts/src/dbos-executor.ts
Line 1103 in 21e8a0e
Description
Currently, step function retry logic is maintained only in memory, which leads to a durability issue when the application restarts. This means that if a step function is configured with retries (e.g., max 5 retries with 1-hour delays) and the application crashes or restarts, the retry count and state are lost, and the step function will start fresh with 0 retries when the application comes back up.
Current Behavior
Expected Behavior
Technical Details
The issue exists in
dbos-executor.ts
in thecallStepFunction
method. The retry logic is maintained purely in memoryImpact
This issue affects the reliability and consistency of step function retry policies, particularly for long-running retry scenarios or in production environments where application restarts are common.
Suggested Solution
Labels
The text was updated successfully, but these errors were encountered: