Replies: 2 comments 3 replies
-
This discussion may be informative - #1537 (reply in thread) cc @mkuratczyk |
Beta Was this translation helpful? Give feedback.
-
Indeed, I can reproduce this with the default configuration and just 1 publisher, 1 QQ and 1 consumer: So I agree that we should change something here. The options are:
Personally, I think it's time we changed As for the alarms - they block all publishers, since there is no coupling between publishers and queues. With quorum queues, the queue is usually on all nodes anyway, but even with a classic queue - if the queue is on node-0 and that node runs out of memory, we have to block all publishers, because a publisher connected to node-1 may publish a message to a queue on node-0. I think the whole alarm mechanism is something that will need to be reconsidered to be honest. Again, the assumptions were correct years ago but things change. When classic queues were the only game in town, things were relatively simple - block publishers, allow consumers to consume some of the messages, that releases the memory (or disk) and we can continue. However, since then:
So I think the ultimate solution is a change to the default alarm threshold in RabbitMQ. We could introduce some other mitigations sooner in the Operator, but I'm not 100% sold on whether that's worth it. For example - you mentioned you have a 3-node cluster, that's already a difference compared to the default single-node deployment. The same document that says 3-nodes should be used in prod, also talks about memory considerations: https://rabbitmq.com/production-checklist.html#resource-limits-ram I'm open to counter-arguments but I'm not sure about the value of introducing changes that will be relevant for just a few months and only for a small subset of users. And if we feel like we need a solution/workaround soon, I'd go with a slightly higher pod memory limit that we'll revert back to 2GB later on. Having the default configuration subtly different between Operator deployments and other deployments is confusing for users (who find generic docs) and us, when people report issues. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
We occasionally find in our GKE rabbitmq cluster (3 nodes) that memory alarms will get triggered, and causing issues with publishes. Digging through docs, I've learned that:
It seems like for a cluster using these defaults, a memory alarm will certainly be triggered at some point, even with just one quorum queue? There are reports of some folks seeing this issue unless they lower the default WAL size limit.
I was also unclear on what happens when these alarms are set. The docs say that publishes are blocked, but is that to the offending node only or to the whole cluster? I did find in AWS MQ docs this:
So a couple questions:
Regardless, it seems to me that more sensible defaults could be configured here.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
By default, I expect quorum queue WAL size threshold and cluster operator's memory requests to work with each other so memory alarm's aren't triggered by normal usage of quorum queues.
Screenshots
Version and environment information
Beta Was this translation helpful? Give feedback.
All reactions