Fluent bit retry behavior is sub-exponential and very hard to reason about. #9634

dkratunov · 2024-11-22T18:44:53Z

Bug Report

Describe the bug
The scheduler.base, scheduler.cap and Retry_Limit discussion in the docs looks like it's talking about exponential backoff but it's actually a very difficult to reason random variable.

In exponential backoff, you can trivially compute the expected wait time at each retry (plus or minus the random jitter), as well as the cumulative wait time up to and including that retry.

In fluent-bit, the cumulative wait time at each retry is the expected value of a sum of independent uniformly distributed random variables, with expanding upper bound on the range.

The formula ends up being this:

Analytical Formula:

Let’s define $s$ as the minimum of $n$ and the largest integer $k$ such that:

$$\text{base} \times 2^k \leq \text{cap}$$

or equivalently:

$$s = \min\left(n, \left\lfloor \log_2\left(\frac{\text{cap}}{\text{base}}\right) \right\rfloor \right)$$

The total expected wait time becomes:

$$E[T] = \frac{1}{2} \left[ \text{base} \times (n + 1) + \text{base} \times \left(2^{s+1} - 1\right) + (\text{cap} \times (n - s)) \right]$$

(I can share the derivation if you care about it)

If we take base 7, cap 3600, we get the following $E[T]$ curve, compared to normal exponential backoff:

Note that it's sub-exponential.

Further, as we increase in retries, the distribution for the cumulative wait time becomes more and more uniform (which is bad and not at all what we want for retries):

It's not visible in the graphs but it's also possible to have purely linear retries under the current algorithm.

Expected behavior
Fluent-bit should use normal exponential backoff with either a fixed range for the jitter, or a ratio (e.g. 10% jitter). The current approach is an operational nightmare with horrible long tail behavior.

This should probably be behind a flag that toggles between the two computations.

dkratunov added the status: waiting-for-triage label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluent bit retry behavior is sub-exponential and very hard to reason about. #9634

Fluent bit retry behavior is sub-exponential and very hard to reason about. #9634

dkratunov commented Nov 22, 2024 •

edited

Loading

Fluent bit retry behavior is sub-exponential and very hard to reason about. #9634

Fluent bit retry behavior is sub-exponential and very hard to reason about. #9634

Comments

dkratunov commented Nov 22, 2024 • edited Loading

Bug Report

Analytical Formula:

dkratunov commented Nov 22, 2024 •

edited

Loading