From 457fe6c8c6e29d7211f57ca839140ac733369df5 Mon Sep 17 00:00:00 2001 From: Bryan Boreham Date: Fri, 20 Dec 2024 17:08:02 +0000 Subject: [PATCH] Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries (#9410) * Runbook: clarify MimirIngesterReachingSeriesLimit errors and retries Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com> --- docs/sources/mimir/manage/mimir-runbooks/_index.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md index 5cce349f120..ef4bb7884e5 100644 --- a/docs/sources/mimir/manage/mimir-runbooks/_index.md +++ b/docs/sources/mimir/manage/mimir-runbooks/_index.md @@ -41,7 +41,15 @@ If nothing obvious from the above, check for increased load: ### MimirIngesterReachingSeriesLimit -This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new series, while appending samples to existing ones will continue to succeed. +This alert fires when the `max_series` per ingester instance limit is enabled and the actual number of in-memory series in an ingester is close to reaching the limit. +The threshold is set at 80% to give the chance to react before the limit is reached. +After the limit is reached, write requests to the ingester fail for new series. Appending samples to existing ones continue to succeed. + +Note that the error responses sent back to the sender are classified as "server errors" (5xx), which should result in a retry by the sender. +While this situation continues, these retries stall the flow of data, and newer data queues up on the sender. +If the condition is cleared in a short time, service can be restored with no data loss. + +This is different to what happens when the `max_global_series_per_user` limit is exceeded, which is considered a "client error" (4xx). In this case, excess data is discarded. In case of **emergency**: @@ -123,7 +131,7 @@ How to **fix** it: ### MimirIngesterReachingTenantsLimit -This alert fires when the `max_tenants` per ingester instance limit is enabled and the actual number of tenants in an ingester is reaching the limit. Once the limit is reached, writes to the ingester will fail (5xx) for new tenants, while they will continue to succeed for previously existing ones. +This alert fires when the `max_tenants` per ingester instance limit is enabled and the actual number of tenants in an ingester is reaching the limit. Once the limit is reached, write requests to the ingester will fail (5xx) for new tenants, while they will continue to succeed for previously existing ones. The per-tenant memory utilisation in ingesters includes the overhead of allocations for TSDB stripes and chunk writer buffers. If the tenant number is high, this may contribute significantly to the total ingester memory utilization. The size of these allocations is controlled by `-blocks-storage.tsdb.stripe-size` (default 16KiB) and `-blocks-storage.tsdb.head-chunks-write-buffer-size-bytes` (default 4MiB), respectively.