Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent bit retry behavior is sub-exponential and very hard to reason about. #9634

Open
dkratunov opened this issue Nov 22, 2024 · 0 comments

Comments

@dkratunov
Copy link

dkratunov commented Nov 22, 2024

Bug Report

Describe the bug
The scheduler.base, scheduler.cap and Retry_Limit discussion in the docs looks like it's talking about exponential backoff but it's actually a very difficult to reason random variable.

In exponential backoff, you can trivially compute the expected wait time at each retry (plus or minus the random jitter), as well as the cumulative wait time up to and including that retry.

In fluent-bit, the cumulative wait time at each retry is the expected value of a sum of independent uniformly distributed random variables, with expanding upper bound on the range.

The formula ends up being this:

Analytical Formula:

Let’s define $s$ as the minimum of $n$ and the largest integer $k$ such that:

$$\text{base} \times 2^k \leq \text{cap}$$

or equivalently:

$$s = \min\left(n, \left\lfloor \log_2\left(\frac{\text{cap}}{\text{base}}\right) \right\rfloor \right)$$

The total expected wait time becomes:

$$E[T] = \frac{1}{2} \left[ \text{base} \times (n + 1) + \text{base} \times \left(2^{s+1} - 1\right) + (\text{cap} \times (n - s)) \right]$$

(I can share the derivation if you care about it)

If we take base 7, cap 3600, we get the following $E[T]$ curve, compared to normal exponential backoff:

Note that it's sub-exponential.

Further, as we increase in retries, the distribution for the cumulative wait time becomes more and more uniform (which is bad and not at all what we want for retries):

It's not visible in the graphs but it's also possible to have purely linear retries under the current algorithm.

Expected behavior
Fluent-bit should use normal exponential backoff with either a fixed range for the jitter, or a ratio (e.g. 10% jitter). The current approach is an operational nightmare with horrible long tail behavior.

This should probably be behind a flag that toggles between the two computations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant