-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
active shards 1 crate #384
Conversation
chicco785
commented
Oct 21, 2020
- by enabling "write.wait_for_active_shards = 1" writes are possible also when table health is not 100%
As we discussed this morning, this could have the potential downside of data loss---which isn't a big deal in my opinion since losing a few data points in a series of thousands isn't a tragedy. So we're trading resiliency for better throughput---writes only wait for the primary shard, not for the replicas. Makes sense to me. But going over the docs again, my understanding is that
The only way out of the impasse would be to manually force replica promotion. I can't be 100% sure of this since docs are ambiguous to say the least. But perhaps we should test to see if this is a likely scenario, better to be safe than sorry kind of thing... |
bb92432
to
91cba09
Compare
91cba09
to
be10579
Compare
be10579
to
f433278
Compare
CLA Assistant Lite bot All contributors have signed the CLA ✍️ |
I have read the CLA Document and I hereby sign the CLA |
recheckcla |
…tive_shards = 1" writes are possible also when table health is not 100% * use variable to set values for wait_for_active_shards
666428f
to
1197ec9
Compare
Dear @chicco785 and @c0c0n3, thanks for exposing the Thanks in advance! With kind regards, |
good catch! |
@amotl as you're at it, you reckon the scenario detailed in the above comment is possible/likely? Or is it just me not understanding how Crate actually works under the bonnet? Thanks!! :-) |
Dear @c0c0n3, thanks for asking, we appreciate it. However, I have to admit I am personally not that much into the specific details about replication settings yet and want to humbly ask @seut or @mfussenegger to review this in order to clarify any ambiguities. In general, I believe @chicco785 is right, using Please also note that in order to improve the out of the box experience by allowing a subset of nodes to become unavailable without blocking write operations, CrateDB uses The problem we and our customers faced is that with the former default of The procedure to restart a cluster, for example in upgrade scenarios, also resonates with the
apparently not many users are aware of that and will just restart the nodes without using the graceful stop procedure. This problem also manifests more within scenarios where CrateDB is running on Kubernetes clusters where users might just kill PODs when performing a rolling upgrade. So, this topic is overall related to the balancing act of trading performance vs. safety vs. high availability. Saying that, the premium option of I hope @seut or @mfussenegger can elaborate on this topic a bit more and fill in some gaps I might have skipped or correct any false claims. With kind regards, [1] https://crate.io/docs/crate/reference/en/4.3/sql/statements/create-table.html#write-wait-for-active-shards |
@amotl, wow, thanks so much for the detailed answer! wait_for_active_shards: resiliency vs throughput
Yep, that's pretty much what we experienced in our prod clusters. For what is worth, we also think it's a reasonable trade-off--see comments above. @seut, @mfussenegger one thing we'd like to understand though is if
The only way out of the impasse would be to manually force replica promotion. Is this scenario possible or is it just me not understanding how Crate actually works under the bonnet? If the scenario is possible, then is The reason why I'm asking is that we'd like to keep Graceful shutdown
Count me in, I wasn't aware of that either. Thank you sooo much for bringing it up. I'd imagine this could also be a good mitigation procedure for minimising the likelihood a table transitioning to read-only---assuming the scenario I talked about earlier is actually possible. Either way, we'll definitely play around with this setting to make sure nodes shut down gracefully as much as feasible within the constraints of our deployment.
Guilty as charged! This is exactly what we do. Your documentation is actually pretty clear about how to upgrade actually, it's just that we haven't come up with any easy way to automate that in a K8s deployment yet.
I wish that came for free :-) Just joking, I appreciate you guys have to make a living too :-) |
Dear @c0c0n3,
Sorry about my wording here. I meant it to be called "the most safe option". This feature is well provided by the community edition as well, AFAIK. Thanks for giving me the chance to clarify. With kind regards, |
That's great news, thanks for clarifying, we'll definitely look into that! |