You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The scheduler.base, scheduler.cap and Retry_Limit discussion in the docs looks like it's talking about exponential backoff but it's actually a very difficult to reason random variable.
In exponential backoff, you can trivially compute the expected wait time at each retry (plus or minus the random jitter), as well as the cumulative wait time up to and including that retry.
In fluent-bit, the cumulative wait time at each retry is the expected value of a sum of independent uniformly distributed random variables, with expanding upper bound on the range.
The formula ends up being this:
Analytical Formula:
Let’s define $s$ as the minimum of $n$ and the largest integer $k$ such that:
If we take base 7, cap 3600, we get the following $E[T]$ curve, compared to normal exponential backoff:
Note that it's sub-exponential.
Further, as we increase in retries, the distribution for the cumulative wait time becomes more and more uniform (which is bad and not at all what we want for retries):
It's not visible in the graphs but it's also possible to have purely linear retries under the current algorithm.
Expected behavior
Fluent-bit should use normal exponential backoff with either a fixed range for the jitter, or a ratio (e.g. 10% jitter). The current approach is an operational nightmare with horrible long tail behavior.
This should probably be behind a flag that toggles between the two computations.
The text was updated successfully, but these errors were encountered:
Bug Report
Describe the bug
The scheduler.base, scheduler.cap and Retry_Limit discussion in the docs looks like it's talking about exponential backoff but it's actually a very difficult to reason random variable.
In exponential backoff, you can trivially compute the expected wait time at each retry (plus or minus the random jitter), as well as the cumulative wait time up to and including that retry.
In fluent-bit, the cumulative wait time at each retry is the expected value of a sum of independent uniformly distributed random variables, with expanding upper bound on the range.
The formula ends up being this:
Analytical Formula:
Let’s define$s$ as the minimum of $n$ and the largest integer $k$ such that:
or equivalently:
The total expected wait time becomes:
(I can share the derivation if you care about it)
If we take base 7, cap 3600, we get the following$E[T]$ curve, compared to normal exponential backoff:
Note that it's sub-exponential.
Further, as we increase in retries, the distribution for the cumulative wait time becomes more and more uniform (which is bad and not at all what we want for retries):
It's not visible in the graphs but it's also possible to have purely linear retries under the current algorithm.
Expected behavior
Fluent-bit should use normal exponential backoff with either a fixed range for the jitter, or a ratio (e.g. 10% jitter). The current approach is an operational nightmare with horrible long tail behavior.
This should probably be behind a flag that toggles between the two computations.
The text was updated successfully, but these errors were encountered: