diff --git a/src/current/_data/metrics/child-metrics.yml b/src/current/_data/metrics/child-metrics.yml index e6393213331..afe92da0139 100644 --- a/src/current/_data/metrics/child-metrics.yml +++ b/src/current/_data/metrics/child-metrics.yml @@ -233,4 +233,20 @@ feature: all - child_metric_id: rpc.connection.avg_round_trip_latency - feature: all \ No newline at end of file + feature: all + +- child_metric_id: logical_replication.catchup_ranges_by_label + feature: ldr + +- child_metric_id: logical_replication.events_dlqed_by_label + feature: ldr + +- child_metric_id: logical_replication.events_ingested_by_label + feature: ldr + +- child_metric_id: logical_replication.replicated_time_by_label + feature: ldr + +- child_metric_id: logical_replication.scanning_ranges_by_label + feature: ldr + diff --git a/src/current/_data/metrics/metrics-list.csv b/src/current/_data/metrics/metrics-list.csv index 6b195f89ac2..0792493fc70 100644 --- a/src/current/_data/metrics/metrics-list.csv +++ b/src/current/_data/metrics/metrics-list.csv @@ -978,6 +978,8 @@ A likely cause of having a checkpoint is that one of the ranges in this store had inconsistent data among its replicas. Such checkpoint directories are located in auxiliary/checkpoints/rN_at_M, where N is the range ID, and M is the Raft applied index at which this checkpoint was taken.",Directories,GAUGE,COUNT,AVG,NONE +STORAGE,storage.compactions.cancelled.bytes,Cumulative volume of data written to sstables during compactions that were ultimately cancelled due to a conflicting operation.,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE +STORAGE,storage.compactions.cancelled.count,Cumulative count of compactions that were cancelled before they completed due to a conflicting operation.,Compactions,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE STORAGE,storage.compactions.duration,"Cumulative sum of all compaction durations. The rate of this value provides the effective compaction concurrency of a store, @@ -1308,6 +1310,7 @@ APPLICATION,distsender.batch_responses.cross_zone.bytes,"Total byte count of rep monitor the data transmitted.",Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,distsender.batch_responses.replica_addressed.bytes,Total byte count of replica-addressed batch responses received,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,distsender.batches,Number of batches processed,Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE +APPLICATION,distsender.batches.async.in_progress,Number of partial batches currently being executed asynchronously,Partial Batches,GAUGE,COUNT,AVG,NONE APPLICATION,distsender.batches.async.sent,Number of partial batches sent asynchronously,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,distsender.batches.async.throttled,Number of partial batches not sent asynchronously due to throttling,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,distsender.batches.partial,Number of partial batches processed after being divided on range boundaries,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE @@ -2274,15 +2277,17 @@ APPLICATION,kv.protectedts.reconciliation.num_runs,number of successful reconcil APPLICATION,kv.protectedts.reconciliation.records_processed,number of records processed without error during reconciliation on this node,Count,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,kv.protectedts.reconciliation.records_removed,number of records removed during reconciliation runs on this node,Count,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.batch_hist_nanos,Time spent flushing a batch,Nanoseconds,HISTOGRAM,NANOSECONDS,AVG,NONE +APPLICATION,logical_replication.catchup_ranges,Source side ranges undergoing catch up scans (inaccurate with multiple LDR jobs),Ranges,GAUGE,COUNT,AVG,NONE +APPLICATION,logical_replication.catchup_ranges_by_label,Source side ranges undergoing catch up scans,Ranges,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.checkpoint_events_ingested,Checkpoint events ingested by all replication jobs,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.commit_latency,"Event commit latency: a difference between event MVCC timestamp and the time it was flushed into disk. If we batch events, then the difference between the oldest event in the batch and flush is recorded",Nanoseconds,HISTOGRAM,NANOSECONDS,AVG,NONE APPLICATION,logical_replication.events_dlqed,Row update events sent to DLQ,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.events_dlqed_age,Row update events sent to DLQ due to reaching the maximum time allowed in the retry queue,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE -APPLICATION,logical_replication.events_dlqed_by_label,Row update events sent to DLQ by label,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE +APPLICATION,logical_replication.events_dlqed_by_label,Row update events sent to DLQ by label,Failures,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.events_dlqed_errtype,Row update events sent to DLQ due to an error not considered retryable,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.events_dlqed_space,Row update events sent to DLQ due to capacity of the retry queue,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.events_ingested,Events ingested by all replication jobs,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE -APPLICATION,logical_replication.events_ingested_by_label,Events ingested by all replication jobs by label,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE +APPLICATION,logical_replication.events_ingested_by_label,Events ingested by all replication jobs by label,Events,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.events_initial_failure,Failed attempts to apply an incoming row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.events_initial_success,Successful applications of an incoming row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.events_retry_failure,Failed re-attempts to apply a row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE @@ -2291,10 +2296,12 @@ APPLICATION,logical_replication.kv.update_too_old,Total number of updates that w APPLICATION,logical_replication.kv.value_refreshes,Total number of batches that refreshed the previous value,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.logical_bytes,Logical bytes (sum of keys + values) received by all replication jobs,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.replan_count,Total number of dist sql replanning events,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE -APPLICATION,logical_replication.replicated_time_by_label,Replicated time of the logical replication stream by label,Seconds,COUNTER,SECONDS,AVG,NON_NEGATIVE_DERIVATIVE +APPLICATION,logical_replication.replicated_time_by_label,Replicated time of the logical replication stream by label,Seconds,GAUGE,SECONDS,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,logical_replication.replicated_time_seconds,The replicated time of the logical replication stream in seconds since the unix epoch.,Seconds,GAUGE,SECONDS,AVG,NONE APPLICATION,logical_replication.retry_queue_bytes,The replicated time of the logical replication stream in seconds since the unix epoch.,Bytes,GAUGE,BYTES,AVG,NONE APPLICATION,logical_replication.retry_queue_events,The replicated time of the logical replication stream in seconds since the unix epoch.,Events,GAUGE,COUNT,AVG,NONE +APPLICATION,logical_replication.scanning_ranges,Source side ranges undergoing an initial scan (inaccurate with multiple LDR jobs),Ranges,GAUGE,COUNT,AVG,NONE +APPLICATION,logical_replication.scanning_ranges_by_label,Source side ranges undergoing an initial scan,Ranges,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,obs.tablemetadata.update_job.duration,Time spent running the update table metadata job.,Duration,HISTOGRAM,NANOSECONDS,AVG,NONE APPLICATION,obs.tablemetadata.update_job.errors,The total number of errors that have been emitted from the update table metadata job.,Errors,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE APPLICATION,obs.tablemetadata.update_job.runs,The total number of runs of the update table metadata job.,Executions,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE diff --git a/src/current/_includes/v24.3/child-metrics-table.md b/src/current/_includes/v24.3/child-metrics-table.md index 05d57c55453..2b1a16f092a 100644 --- a/src/current/_includes/v24.3/child-metrics-table.md +++ b/src/current/_includes/v24.3/child-metrics-table.md @@ -7,7 +7,7 @@ Following is a list of the metrics that have child metrics: CockroachDB Metric Name - Description When Aggregated + {% if feature == "ldr" %}Description{% else %}Description When Aggregated{% endif %} Type Unit diff --git a/src/current/v24.3/child-metrics.md b/src/current/v24.3/child-metrics.md index 3618ec7d5ad..8fee7bd1a6a 100644 --- a/src/current/v24.3/child-metrics.md +++ b/src/current/v24.3/child-metrics.md @@ -110,6 +110,41 @@ changefeed_error_retries{node_id="1",scope="office_dogs"} 0 {% assign feature = "changefeed" %} {% include {{ page.version.version }}/child-metrics-table.md %} +## Clusters with logical data replication jobs + +When child metrics is enabled and [logical data replication (LDR) jobs with metrics labels]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}#metrics-labels) are created on the cluster, the `logical_replication_*_by_label` metrics are exported per LDR metric label. The `label` may have the values set using the `label` option. The cardinality increases with the number of LDR metric labels. + +For example, when you create two LDR jobs with the metrics labels `ldr_job1` and `ldr_job2`, the metrics `logical_replication_*_by_label` export child metrics with a `label` for `ldr_job1` and `ldr_job2`. + +~~~ +# HELP logical_replication_replicated_time_by_label Replicated time of the logical replication stream by label +# TYPE logical_replication_replicated_time_by_label gauge +logical_replication_replicated_time_by_label{label="ldr_job2",node_id="2"} 1.73411035e+09 +logical_replication_replicated_time_by_label{label="ldr_job1",node_id="2"} 1.73411035e+09 +# HELP logical_replication_catchup_ranges_by_label Source side ranges undergoing catch up scans +# TYPE logical_replication_catchup_ranges_by_label gauge +logical_replication_catchup_ranges_by_label{label="ldr_job1",node_id="2"} 0 +logical_replication_catchup_ranges_by_label{label="ldr_job2",node_id="2"} 0 +# HELP logical_replication_scanning_ranges_by_label Source side ranges undergoing an initial scan +# TYPE logical_replication_scanning_ranges_by_label gauge +logical_replication_scanning_ranges_by_label{label="ldr_job1",node_id="2"} 0 +logical_replication_scanning_ranges_by_label{label="ldr_job2",node_id="2"} 0 +~~~ + +Note that the `logical_replication_*` metrics without the `_by_label` suffix may be `inaccurate with multiple LDR jobs`. + +~~~ +# HELP logical_replication_catchup_ranges Source side ranges undergoing catch up scans (inaccurate with multiple LDR jobs) +# TYPE logical_replication_catchup_ranges gauge +logical_replication_catchup_ranges{node_id="2"} 0 +# HELP logical_replication_scanning_ranges Source side ranges undergoing an initial scan (inaccurate with multiple LDR jobs) +# TYPE logical_replication_scanning_ranges gauge +logical_replication_scanning_ranges{node_id="2"} 0 +~~~ + +{% assign feature = "ldr" %} +{% include {{ page.version.version }}/child-metrics-table.md %} + ## Clusters with row-level TTL jobs When child metrics is enabled and [row-level TTL jobs]({% link {{ page.version.version }}/row-level-ttl.md %}) are created on the cluster with the [`ttl_label_metrics` storage parameter enabled]({% link {{ page.version.version }}/row-level-ttl.md %}#ttl-metrics), the `jobs.row_level_ttl.*` metrics are exported per TTL job with `ttl_label_metrics` enabled with a label for `relation`. The value of the `relation` label may have the format: `{database}_{schema}_{table}_{primary key}`. The cardinality increases with the number of TTL jobs with `ttl_label_metrics` enabled. An aggregated metric is also included. diff --git a/src/current/v24.3/logical-data-replication-monitoring.md b/src/current/v24.3/logical-data-replication-monitoring.md index 77730488cc9..240744851fa 100644 --- a/src/current/v24.3/logical-data-replication-monitoring.md +++ b/src/current/v24.3/logical-data-replication-monitoring.md @@ -114,11 +114,11 @@ You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer To view metrics at the job level, you can use the `label` option when you start LDR to add a metrics label to the LDR job. This enables [child metric]({% link {{ page.version.version }}/child-metrics.md %}) export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels: -- `logical_replication.replicated_time_seconds` -- `logical_replication.events_ingested` -- `logical_replication.events_dlqed` -- `logical_replication.scanning_ranges` -- `logical_replication.catchup_ranges` +- `logical_replication.catchup_ranges_by_label` +- `logical_replication.events_dlqed_by_label` +- `logical_replication.events_ingested_by_label` +- `logical_replication.replicated_time_by_label` +- `logical_replication.scanning_ranges_by_label` To use metrics labels, ensure you have enabled the child metrics cluster setting: @@ -136,7 +136,7 @@ ON 'external://{source_external_connection}' INTO TABLE {database.public.table_name} WITH label=ldr_job; ~~~ -For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}) page. +For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}#clusters-with-logical-data-replication-jobs) page. ### Datadog diff --git a/src/current/v24.3/ui-logical-data-replication-dashboard.md b/src/current/v24.3/ui-logical-data-replication-dashboard.md index 468ac5d380e..40ec394da3e 100644 --- a/src/current/v24.3/ui-logical-data-replication-dashboard.md +++ b/src/current/v24.3/ui-logical-data-replication-dashboard.md @@ -1,6 +1,6 @@ --- title: Logical Data Replication Dashboard -summary: The Physical Cluster Replication Dashboard lets you monitor and observe replication streams between a primary and standby cluster. +summary: The Logical Data Replication Dashboard lets you monitor and observe logical data replication jobs on the destination cluster. toc: true docs_area: reference.db_console ---