RAFT leadership transfers and health check failures #6079

slice-arpitkhatri · 2024-11-05T17:09:20Z

Observed behavior

We've observed frequent RAFT leadership transfers of the $MQTT_PUBREL consumers and health check failures, even in a steady state. Occasionally, these issues escalate, causing sharp spikes in leadership transfers and health check failures, which lead to cluster downtime.

During these intense spikes, metrics from NATS Surveyor show an enormous surge in system messages, with counts reaching billions of messages per minute (metric name: nats_core_account_msgs_recv).

System details

Peak load of 5k MQTT clients, each with 2 QoS 2 subscriptions, totaling 10k subscriptions across 10k MQTT topics.
Messages produced at ~10 RPS
A single NATS queue group subscription is used to consume MQTT-published messages on one topic.

Additional details

Cluster of 3 nodes
max_outstanding_catcup 128MB

Associated logs:

RAFT [cnrtt3eg - C-R3F-yMOeq7kb] Stepping down due to leadership transfer
Falling behind in health check, commit 3202757 != applied 3202742
Healthcheck failed: "JetStream is not current with the meta leader"

nats traffic in steady state (taken minutes after starting the pods) :

nats-traffic-of-sys-account.txt

Expected behavior

No leadership transfers of consumers & no health check failures in steady state.

Server and client version

Nats Server version 2.10.22

Host environment

Kubernetes v1.25

Steps to reproduce

Setup a 3 node NATS cluster, start 5k MQTT connections with 10k (2 per each client) QOS 2 subscriptions and publish QOS 2 messages at 10 RPS.

The text was updated successfully, but these errors were encountered:

neilalexander · 2024-11-06T09:24:27Z

Can you please provide more complete logs from around the times of the problem, as well as server configs?

Do you have account limits and/or max_file/max_mem set?

Normally the only things that should be causing leader transfers on streams in normal operation is a) if you ask it to by issuing a step-down, or b) if you've hit up against the configured JetStream system limits.

slice-arpitkhatri · 2024-11-06T11:51:27Z

@neilalexander We do not have any account level limits. max_file_store is 50GB and max_memory_store is at 10GB.

Have shared the config file and complete logs over email. Let me know if you want any additional details.

neilalexander · 2024-11-06T13:28:01Z

I've taken a look at the logs you sent through but it appears as though the system is already unstable by the start of the logs? Was there a network-level event leading up to this, or any nodes that restarted unexpectedly?

slice-arpitkhatri · 2024-11-06T13:48:10Z

@neilalexander We didn't observe any network-level events. The nodes did restart due to health check failures. I've sent you another email containing additional logs from an hour before the instability occurred. Let me know if that helps or if you have any additional queries

levb · 2024-11-06T14:09:11Z

I am going to try reproducing this from the MQTT side. The QoS2-on-JetStream implementation is quite resource intensive (per sub, and per message), this kind of volume might have introduced failures, and ultimately blocking the IO (readloop) waiting for JS responses before acknowledging back to the MQTT clients, as required by the protocol.

slice-arpitkhatri · 2024-11-06T14:18:40Z

@levb have shared the config file with Neil. Let me know if you need any additional inputs in reproducing this. Can jump on a call as well if required.

slice-arpitkhatri · 2024-11-08T12:06:13Z

@levb @neilalexander My hunch is that the huge amount of raft sync required for R3 consumers might be causing the instability in the system. Even in steady state scenario we have 2Mil system messages per minute. Let me know your thoughts on this?

@derekcollison Do we have any plans to support R3 file streams with R1 memory consumers?

derekcollison · 2024-11-08T16:51:09Z

That is supported today. Under mqqt config section you have the following options to control consumers.

config blocks just convert to snakecase, e.g. consumer_replicas = 1

slice-arpitkhatri · 2024-11-08T19:46:15Z

@derekcollison
I believe the consumer_replicas setting under the MQTT config is currently not in use (server ignores this config, see this), and that the consumer replicas are instead aligned with the parent stream replica for interest or workqueue streams ( source )

Additionally, we have already set consumer_replicas as 1 in our production cluster, and I can see that the consumers still have a raft leader, which wouldn't be the case if this consumer replica override config were functional.

Do we have plans to re-introduce this consumer replica override capability?

derekcollison · 2024-11-08T20:47:19Z

It will work but yes if there are retention based streams backing the MQTT stuff the system will override and force the peer sets to be the same.

This QOS2?

levb · 2024-11-08T21:50:17Z

@derekcollison this ticket is, but @slice-arpitkhatri said they got into this state with QoS1 as well,

slice-arpitkhatri · 2024-11-09T03:55:15Z

Yes, have faced the issue with both QOS 1 and QOS 2.

slice-arpitkhatri added the defect Suspected defect such as a bug or regression label Nov 5, 2024

derekcollison assigned neilalexander, levb and MauriceVanVeen Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAFT leadership transfers and health check failures #6079

RAFT leadership transfers and health check failures #6079

slice-arpitkhatri commented Nov 5, 2024 •

edited

Loading

neilalexander commented Nov 6, 2024

slice-arpitkhatri commented Nov 6, 2024

neilalexander commented Nov 6, 2024

slice-arpitkhatri commented Nov 6, 2024

levb commented Nov 6, 2024

slice-arpitkhatri commented Nov 6, 2024

slice-arpitkhatri commented Nov 8, 2024 •

edited

Loading

derekcollison commented Nov 8, 2024

slice-arpitkhatri commented Nov 8, 2024 •

edited

Loading

derekcollison commented Nov 8, 2024

levb commented Nov 8, 2024

slice-arpitkhatri commented Nov 9, 2024

RAFT leadership transfers and health check failures #6079

RAFT leadership transfers and health check failures #6079

Comments

slice-arpitkhatri commented Nov 5, 2024 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

neilalexander commented Nov 6, 2024

slice-arpitkhatri commented Nov 6, 2024

neilalexander commented Nov 6, 2024

slice-arpitkhatri commented Nov 6, 2024

levb commented Nov 6, 2024

slice-arpitkhatri commented Nov 6, 2024

slice-arpitkhatri commented Nov 8, 2024 • edited Loading

derekcollison commented Nov 8, 2024

slice-arpitkhatri commented Nov 8, 2024 • edited Loading

derekcollison commented Nov 8, 2024

levb commented Nov 8, 2024

slice-arpitkhatri commented Nov 9, 2024

slice-arpitkhatri commented Nov 5, 2024 •

edited

Loading

slice-arpitkhatri commented Nov 8, 2024 •

edited

Loading

slice-arpitkhatri commented Nov 8, 2024 •

edited

Loading