-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Circuit Breaker Pattern to Protect PD Leader #8678
Comments
@niubell PTAL |
how about TiKV or other services? do we also need to consider together? |
@siddontang I agree. Given that a single key could be served only by on TiKV, we can get many tidb servers hammering single tikv server, so this looks like a similar problem as with PD. |
Stage 1: Introduce circuit breaker for go client
The more details can be expanded when we implement it. |
@siddontang Yes, the circuit-breaker pattern also applies to client-go(TiKV), this RFC only focuses on the high QPS domains/APIs in PD, and other components can reuse some base libs about circuit-breaker, cc @Tema |
@zhangjinpeng87 Thanks for the reminder, we have discussed details several times online before the formal RFC, LGTM |
Feature Request
Describe your feature request related problem
In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios. Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state. Existing mechanisms, such as the one discussed in Issue #4480 and the PR #6834, introduce rate-limiting and backoff strategies but are insufficient to prevent PD from being overloaded by a high volume of traffic, even before reaching the server-side limits.
Describe the feature you'd like
I propose implementing a circuit breaker pattern to protect the PD leader from overloading due to retry storms or similar feedback loops. This circuit breaker would:
This feature is especially critical for large clusters where a high number of pods can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health.
Describe alternatives you've considered
While existing solutions like rate-limiting in Issue #4480 and PR #6834 provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process.
Teachability, Documentation, Adoption, Migration Strategy
Introducing the circuit breaker pattern would likely require adjustments in the client request logic across TiDB, TiKV, TiFlash, CDC, and PD-ctl components. This feature could be made configurable, allowing users to set custom thresholds and recovery parameters to fit their specific cluster sizes and workloads.
Documentation would need to include clear guidelines on:
Scenarios where this feature could be helpful include:
The text was updated successfully, but these errors were encountered: