From 176f8cbec5039cf4fb1098e1fad5c0976c85ee2e Mon Sep 17 00:00:00 2001 From: Matt Linville Date: Wed, 30 Oct 2024 10:12:42 -0700 Subject: [PATCH] [DOC-11431] Document admission control for snapshot ingestion --- src/current/v24.3/admission-control.md | 10 ++++++---- src/current/v24.3/architecture/replication-layer.md | 6 ++++-- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/src/current/v24.3/admission-control.md b/src/current/v24.3/admission-control.md index 8c19c34be60..064d666ed40 100644 --- a/src/current/v24.3/admission-control.md +++ b/src/current/v24.3/admission-control.md @@ -44,15 +44,16 @@ Almost all database operations that use CPU or perform storage IO are controlled - [General SQL queries]({% link {{ page.version.version }}/selection-queries.md %}) have their CPU usage subject to admission control, as well as storage IO for writes to [leaseholder replicas]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leases). - [Bulk data imports]({% link {{ page.version.version }}/import-into.md %}). -- [Backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}). -- [Schema changes]({% link {{ page.version.version }}/online-schema-changes.md %}), including index and column backfills (on both the [leaseholder replica]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leases) and [follower replicas]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft)). - [`COPY`]({% link {{ page.version.version }}/copy-from.md %}) statements. - [Deletes]({% link {{ page.version.version }}/delete-data.md %}) (including deletes initiated by [row-level TTL jobs]({% link {{ page.version.version }}/row-level-ttl.md %}); the [selection queries]({% link {{ page.version.version }}/selection-queries.md %}) performed by TTL jobs are also subject to CPU admission control). +- [Backups]({% link {{ page.version.version }}/backup-and-restore-overview.md %}). +- [Schema changes]({% link {{ page.version.version }}/online-schema-changes.md %}), including index and column backfills (on both the [leaseholder replica]({% link {{ page.version.version }}/architecture/replication-layer.md %}#leases) and [follower replicas]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft)). - [Follower replication work]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft). - [Raft log entries being written to disk]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft). - [Changefeeds]({% link {{ page.version.version }}/create-and-configure-changefeeds.md %}). - [Intent resolution]({% link {{ page.version.version }}/architecture/transaction-layer.md %}#write-intents). - +- {% include_cached new-in.html version="v24.3" %} [Snapshot transfers]({% link {{ page.version.version }}/architecture/replication-layer.md %}#snapshots) onto a node with a [provisioned rate]({% link {{ page.version.version }}/cockroach-start.md %}#store) configured for its store, based on disk bandwidth, to reduce the impact on foreground workloads on the node. Admission control for snapshot transfers is disabled by default. To learn more, refer to [Snapshots]({% link {{ page.version.version }}/architecture/replication-layer.md %}#snapshots). +- The following operations are not subject to admission control: - SQL writes are not subject to admission control on [follower replicas]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) by default, unless those writes occur in transactions that are subject to a Quality of Service (QoS) level as described in [Set quality of service level for a session](#set-quality-of-service-level-for-a-session). In order for writes on follower replicas to be subject to admission control, the setting `default_transaction_quality_of_service=background` must be used. @@ -68,6 +69,7 @@ Admission control is enabled by default. To enable or disable admission control, - `admission.kv.enabled` for work performed by the [KV layer]({% link {{ page.version.version }}/architecture/distribution-layer.md %}). - `admission.sql_kv_response.enabled` for work performed in the SQL layer when receiving [KV responses]({% link {{ page.version.version }}/architecture/distribution-layer.md %}). - `admission.sql_sql_response.enabled` for work performed in the SQL layer when receiving [DistSQL responses]({% link {{ page.version.version }}/architecture/sql-layer.md %}#distsql). +- {% include_cached new-in.html version="v24.3" %} `kvadmission.store.snapshot_ingest_bandwidth_control.enabled` to optionally limit the disk impact of ingesting snapshots on a node. When you enable or disable admission control settings for one layer, Cockroach Labs recommends that you enable or disable them for **all layers**. @@ -134,7 +136,7 @@ COMMIT; ## Considerations -[Client connections]({% link {{ page.version.version }}/connection-parameters.md %}) are not managed by the admission control subsystem. Too many connections per [gateway node]({% link {{ page.version.version }}/architecture/sql-layer.md %}#gateway-node) can also lead to cluster overload. +[Client connections]({% link {{ page.version.version }}/connection-parameters.md %}) are not managed by the admission control subsystem. Too many connections per [gateway node]({% link {{ page.version.version }}/architecture/sql-layer.md %}#gateway-node) can also lead to cluster overload. {% include {{page.version.version}}/sql/server-side-connection-limit.md %} diff --git a/src/current/v24.3/architecture/replication-layer.md b/src/current/v24.3/architecture/replication-layer.md index 137acbfd78c..bc9ad53d3c7 100644 --- a/src/current/v24.3/architecture/replication-layer.md +++ b/src/current/v24.3/architecture/replication-layer.md @@ -72,13 +72,13 @@ Non-voting replicas can be configured via [zone configurations through `num_vote ##### Overview -When individual [ranges]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. +When individual [ranges]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-range) become temporarily unavailable, requests to those ranges are refused by a per-replica "circuit breaker" mechanism instead of hanging indefinitely. From a user's perspective, this means that if a [SQL query]({% link {{ page.version.version }}/architecture/sql-layer.md %}) is going to ultimately fail due to accessing a temporarily unavailable range, a [replica]({% link {{ page.version.version }}/architecture/overview.md %}#architecture-replica) in that range will trip its circuit breaker (after 60 seconds [by default](#per-replica-circuit-breaker-timeout)) and bubble a `ReplicaUnavailableError` error back up through the system to inform the user why their query did not succeed. These (hopefully transient) errors are also signalled as events in the DB Console's [Replication Dashboard]({% link {{ page.version.version }}/ui-replication-dashboard.md %}) and as "circuit breaker errors" in its [**Problem Ranges** and **Range Status** pages]({% link {{ page.version.version }}/ui-debug-pages.md %}). Meanwhile, CockroachDB continues asynchronously probing the range's availability. If the replica becomes available again, the breaker is reset so that it can go back to serving requests normally. This feature is designed to increase the availability of your CockroachDB clusters by making them more robust to transient errors. -For more information about per-replica circuit breaker events happening on your cluster, see the following pages in the [DB Console]({% link {{ page.version.version }}/ui-overview.md %}): +For more information about per-replica circuit breaker events happening on your cluster, see the following pages in the [DB Console]({% link {{ page.version.version }}/ui-overview.md %}): - The [**Replication** dashboard]({% link {{ page.version.version }}/ui-replication-dashboard.md %}). - The [**Advanced Debug** page]({% link {{ page.version.version }}/ui-debug-pages.md %}). From there you can view the **Problem Ranges** page, which lists the range replicas whose circuit breakers were tripped. You can also view the **Range Status** page, which displays the circuit breaker error message for a given range. @@ -116,6 +116,8 @@ Sending data locally using delegated snapshots has the following benefits: Delegated snapshots are managed automatically by the cluster with no need for user involvement. +{% include_cached new-in.html version="v24.3" %}To limit the impact of snapshot ingestion on a node with a [provisioned rate]({% link {{ page.version.version }}/cockroach-start.md %}#store) configured for its store, you can enable [admission control]({% link {{ page.version.version }}/admission-control.md %}) for snapshot transfer, based on disk bandwidth. This allows you to limit the disk impact on foreground workloads on the node. Admission control for snapshot transfers is disabled by default; to enable it, set the [cluster setting]({% link {{ page.version.version }}/cluster-settings.md %}) `kvadmission.store.snapshot_ingest_bandwidth_control.enabled` to `true`. The historgram [metric]({% link {{ page.version.version }}/metrics.md %}) `admission.wait_durations.snapshot_ingest` allows you to observe the wait times for snapshots that were impacted by admission control. + ### Leases A single node in the Raft group acts as the leaseholder, which is the only node that can serve reads or propose writes to the Raft group leader (both actions are received as `BatchRequests` from [`DistSender`]({% link {{ page.version.version }}/architecture/distribution-layer.md %}#distsender)).