Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

active shards 1 crate #384

Merged
merged 4 commits into from
Dec 1, 2020
Merged

active shards 1 crate #384

merged 4 commits into from
Dec 1, 2020

Conversation

chicco785
Copy link
Contributor

  • by enabling "write.wait_for_active_shards = 1" writes are possible also when table health is not 100%

@c0c0n3
Copy link
Member

c0c0n3 commented Oct 27, 2020

As we discussed this morning, this could have the potential downside of data loss---which isn't a big deal in my opinion since losing a few data points in a series of thousands isn't a tragedy. So we're trading resiliency for better throughput---writes only wait for the primary shard, not for the replicas. Makes sense to me.

But going over the docs again, my understanding is that wait_for_active_shards = 1 could make this scenario more likely:

  1. A node N1 holds a primary shard S with records r[1] to r[m + n].
  2. Another node N2 holds S's replica shard, R, with records r[1] to r[m], i.e. n records haven't been replicated yet.
  3. N1 goes down.
  4. Crate won't promote N2 as primary since it knows R is stale w/r/t S.

The only way out of the impasse would be to manually force replica promotion.

I can't be 100% sure of this since docs are ambiguous to say the least. But perhaps we should test to see if this is a likely scenario, better to be safe than sorry kind of thing...

@chicco785 chicco785 force-pushed the active-shards-1-crate branch from bb92432 to 91cba09 Compare November 12, 2020 11:09
@chicco785 chicco785 force-pushed the active-shards-1-crate branch from 91cba09 to be10579 Compare November 19, 2020 17:03
@chicco785 chicco785 changed the title WIP: active shards 1 crate active shards 1 crate Nov 19, 2020
@chicco785 chicco785 force-pushed the active-shards-1-crate branch from be10579 to f433278 Compare November 20, 2020 09:31
@github-actions
Copy link
Contributor

github-actions bot commented Nov 20, 2020

CLA Assistant Lite bot All contributors have signed the CLA ✍️

@chicco785
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@chicco785
Copy link
Contributor Author

recheckcla

…tive_shards = 1" writes are possible also when table health is not 100%

* use variable to set values for wait_for_active_shards
@chicco785 chicco785 force-pushed the active-shards-1-crate branch from 666428f to 1197ec9 Compare November 20, 2020 10:23
@amotl
Copy link

amotl commented Nov 20, 2020

Dear @chicco785 and @c0c0n3,

thanks for exposing the wait_for_active_shards setting to users of ngsi-timeseries-api. May I humbly suggest to update the wording from CRATE_WAIT_ACTIV_SHARDS to CRATE_WAIT_ACTIVE_SHARDS? There's an "e" missing.

Thanks in advance!

With kind regards,
Andreas.

@chicco785
Copy link
Contributor Author

thanks for exposing the wait_for_active_shards setting to users of ngsi-timeseries-api. May I humbly suggest to update the wording from CRATE_WAIT_ACTIV_SHARDS to CRATE_WAIT_ACTIVE_SHARDS? There's an "e" missing.

good catch!

@c0c0n3
Copy link
Member

c0c0n3 commented Nov 20, 2020

@amotl as you're at it, you reckon the scenario detailed in the above comment is possible/likely? Or is it just me not understanding how Crate actually works under the bonnet? Thanks!! :-)

@chicco785 chicco785 requested a review from c0c0n3 November 23, 2020 08:23
@amotl
Copy link

amotl commented Nov 24, 2020

Dear @c0c0n3,

thanks for asking, we appreciate it. However, I have to admit I am personally not that much into the specific details about replication settings yet and want to humbly ask @seut or @mfussenegger to review this in order to clarify any ambiguities.

In general, I believe @chicco785 is right, using wait_for_active_shards = 1 instead of wait_for_active_shards = all will require only one shard copy to be active for write operations to proceed [1]. Otherwise, the [write] operation gets blocked and must wait and retry for up to 30s before eventually timing out. CrateDB introduced that option with version 2.0.1 [2].

Please also note that in order to improve the out of the box experience by allowing a subset of nodes to become unavailable without blocking write operations, CrateDB uses wait_for_active_shards = 1 as a default setting beginning with CrateDB 4.1.0 [3]. This decreases write resiliency in favor of write availability and has been changed through crate/crate#8876.

The problem we and our customers faced is that with the former default of wait_for_active_shards = all, if you restart a single node, it basically blocks writes until the whole cluster has recovered again. And we believed the extra safety of all is a bit of a bad trade-off for that, as with inserts blocking until a cluster has recovered (and eventually dropping if it takes too long) you are more likely to loose data from that.

The procedure to restart a cluster, for example in upgrade scenarios, also resonates with the cluster.graceful_stop.min_availability setting, which is defaulting to primaries [4]. While we state in our documentation that

By default, when the CrateDB process stops it simply shuts down, possibly making some shards unavailable which leads to a red cluster state and lets some queries fail that required the now unavailable shards. In order to safely shutdown a CrateDB node, the graceful stop procedure can be used.

apparently not many users are aware of that and will just restart the nodes without using the graceful stop procedure. This problem also manifests more within scenarios where CrateDB is running on Kubernetes clusters where users might just kill PODs when performing a rolling upgrade.

So, this topic is overall related to the balancing act of trading performance vs. safety vs. high availability.

Saying that, the premium option of cluster.graceful_stop.min_availability = full will always keep the database cluster in a green state, while increasing the probability of replica resynchronization, in turn increasing inter-node traffic. Also, when using the graceful stop procedure, a rolling upgrade will take a longer time because all active operations will be fulfilled before shutting down the node instance. This impact is obviously increasing with the amount of data which has to be shuffled around. However, this is the designated "paranoia" option and - as said - will always keep the whole cluster in a green state.

I hope @seut or @mfussenegger can elaborate on this topic a bit more and fill in some gaps I might have skipped or correct any false claims.

With kind regards,
Andreas.

[1] https://crate.io/docs/crate/reference/en/4.3/sql/statements/create-table.html#write-wait-for-active-shards
[2] https://crate.io/docs/crate/reference/en/4.3/appendices/release-notes/2.0.1.html#changes
[3] https://crate.io/docs/crate/reference/en/4.3/appendices/release-notes/4.1.0.html#others
[4] https://crate.io/docs/crate/reference/en/4.3/config/cluster.html#graceful-stop

@c0c0n3
Copy link
Member

c0c0n3 commented Nov 25, 2020

@amotl, wow, thanks so much for the detailed answer!

wait_for_active_shards: resiliency vs throughput

wait_for_active_shards ... trade-off ... as with inserts blocking until a cluster has recovered (and eventually dropping if it takes too long) you are more likely to loose data from that.

Yep, that's pretty much what we experienced in our prod clusters. For what is worth, we also think it's a reasonable trade-off--see comments above.

@seut, @mfussenegger one thing we'd like to understand though is if wait_for_active_shards = 1 could result in a situation where Crate won't accept any further writes to a table and the only way out of that is to manually alter the table settings. Here's the scenario:

  1. A node N1 holds a primary shard S with records r[1] to r[m + n].
  2. Another node N2 holds S's replica shard, R, with records r[1] to r[m], i.e. n records haven't been replicated yet.
  3. N1 goes down.
  4. Crate won't promote N2 as primary since it knows R is stale w/r/t S.

The only way out of the impasse would be to manually force replica promotion.

Is this scenario possible or is it just me not understanding how Crate actually works under the bonnet? If the scenario is possible, then is wait_for_active_shards = 1 going to make it more likely if you have lots of nodes some of which crash regularly, e.g. AWS node going down? Also, in that case, is there any easy way to get notified of a table stuck in read-only mode?

The reason why I'm asking is that we'd like to keep wait_for_active_shards = 1 in prod, but if a table gets stuck we need a way to get to know about it as it happens since some of our tables are pretty busy and even a few hours of downtime could result in losing several thousand data points coming in from our client's devices.

Graceful shutdown

cluster.graceful_stop.min_availability ... apparently not many users are aware of that and will just restart the nodes without using the graceful stop procedure.

Count me in, I wasn't aware of that either. Thank you sooo much for bringing it up. I'd imagine this could also be a good mitigation procedure for minimising the likelihood a table transitioning to read-only---assuming the scenario I talked about earlier is actually possible. Either way, we'll definitely play around with this setting to make sure nodes shut down gracefully as much as feasible within the constraints of our deployment.

This problem also manifests more within scenarios where CrateDB is running on Kubernetes clusters where users might just kill PODs when performing a rolling upgrade.

Guilty as charged! This is exactly what we do. Your documentation is actually pretty clear about how to upgrade actually, it's just that we haven't come up with any easy way to automate that in a K8s deployment yet.

premium option of cluster.graceful_stop.min_availability = full will always keep the database cluster in a green state

I wish that came for free :-) Just joking, I appreciate you guys have to make a living too :-)

@amotl
Copy link

amotl commented Nov 25, 2020

Dear @c0c0n3,

premium option of cluster.graceful_stop.min_availability = full will always keep the database cluster in a green state

I wish that came for free :-) Just joking, I appreciate you guys have to make a living too :-)

Sorry about my wording here. I meant it to be called "the most safe option". This feature is well provided by the community edition as well, AFAIK. Thanks for giving me the chance to clarify.

With kind regards,
Andreas.

@c0c0n3
Copy link
Member

c0c0n3 commented Nov 25, 2020

@amotl

This feature is well provided by the community edition as well, AFAIK

That's great news, thanks for clarifying, we'll definitely look into that!

@chicco785 chicco785 merged commit 970898e into master Dec 1, 2020
@chicco785 chicco785 deleted the active-shards-1-crate branch December 1, 2020 22:53
@github-actions github-actions bot locked and limited conversation to collaborators Dec 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants