DOC-11722 Create a section on the Child Metrics page for LDR (#19230)

(1) In child-metrics.md, added section for Clusters with logical data replication jobs. (2) In child-metrics.yml, added ldr metrics. (3) In metrics-list.csv, with v24.3.0 binary, ran cockroach gen metric-list --format=csv > metrics-list.csv and manually lower-cased column headers and corrected spelling of inaccurate and changed COUNTER to GAUGE for five _by_label metrics. (4) In logical-data-replication-monitoring, (a) updated link to new section in child-metrics.md and (b) updated metrics to _by_label names. (5) In child-metric-table.md include file, modified liquid code to display Description as column header for ldr feature. (6) In ui-logical-data-replication-dashboard.md, corrected the summary in the frontmatter.
cockroachdb · Dec 16, 2024 · 6c1734c · 6c1734c
1 parent fef0a38
commit 6c1734c
Show file tree

Hide file tree

Showing 6 changed files with 70 additions and 12 deletions.
diff --git a/src/current/_data/metrics/child-metrics.yml b/src/current/_data/metrics/child-metrics.yml
@@ -233,4 +233,20 @@
   feature: all
 
 - child_metric_id: rpc.connection.avg_round_trip_latency
-  feature: all
+  feature: all
+
+- child_metric_id: logical_replication.catchup_ranges_by_label
+  feature: ldr
+
+- child_metric_id: logical_replication.events_dlqed_by_label
+  feature: ldr
+
+- child_metric_id: logical_replication.events_ingested_by_label
+  feature: ldr
+
+- child_metric_id: logical_replication.replicated_time_by_label
+  feature: ldr
+
+- child_metric_id: logical_replication.scanning_ranges_by_label
+  feature: ldr
+
diff --git a/src/current/_data/metrics/metrics-list.csv b/src/current/_data/metrics/metrics-list.csv
@@ -978,6 +978,8 @@ A likely cause of having a checkpoint is that one of the ranges in this store
 had inconsistent data among its replicas. Such checkpoint directories are
 located in auxiliary/checkpoints/rN_at_M, where N is the range ID, and M is the
 Raft applied index at which this checkpoint was taken.",Directories,GAUGE,COUNT,AVG,NONE
+STORAGE,storage.compactions.cancelled.bytes,Cumulative volume of data written to sstables during compactions that were ultimately cancelled due to a conflicting operation.,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
+STORAGE,storage.compactions.cancelled.count,Cumulative count of compactions that were cancelled before they completed due to a conflicting operation.,Compactions,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 STORAGE,storage.compactions.duration,"Cumulative sum of all compaction durations.
 
 The rate of this value provides the effective compaction concurrency of a store,
@@ -1308,6 +1310,7 @@ APPLICATION,distsender.batch_responses.cross_zone.bytes,"Total byte count of rep
 		monitor the data transmitted.",Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,distsender.batch_responses.replica_addressed.bytes,Total byte count of replica-addressed batch responses received,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,distsender.batches,Number of batches processed,Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
+APPLICATION,distsender.batches.async.in_progress,Number of partial batches currently being executed asynchronously,Partial Batches,GAUGE,COUNT,AVG,NONE
 APPLICATION,distsender.batches.async.sent,Number of partial batches sent asynchronously,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,distsender.batches.async.throttled,Number of partial batches not sent asynchronously due to throttling,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,distsender.batches.partial,Number of partial batches processed after being divided on range boundaries,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
@@ -2274,15 +2277,17 @@ APPLICATION,kv.protectedts.reconciliation.num_runs,number of successful reconcil
 APPLICATION,kv.protectedts.reconciliation.records_processed,number of records processed without error during reconciliation on this node,Count,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,kv.protectedts.reconciliation.records_removed,number of records removed during reconciliation runs on this node,Count,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.batch_hist_nanos,Time spent flushing a batch,Nanoseconds,HISTOGRAM,NANOSECONDS,AVG,NONE
+APPLICATION,logical_replication.catchup_ranges,Source side ranges undergoing catch up scans (inaccurate with multiple LDR jobs),Ranges,GAUGE,COUNT,AVG,NONE
+APPLICATION,logical_replication.catchup_ranges_by_label,Source side ranges undergoing catch up scans,Ranges,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.checkpoint_events_ingested,Checkpoint events ingested by all replication jobs,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.commit_latency,"Event commit latency: a difference between event MVCC timestamp and the time it was flushed into disk. If we batch events, then the difference between the oldest event in the batch and flush is recorded",Nanoseconds,HISTOGRAM,NANOSECONDS,AVG,NONE
 APPLICATION,logical_replication.events_dlqed,Row update events sent to DLQ,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.events_dlqed_age,Row update events sent to DLQ due to reaching the maximum time allowed in the retry queue,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
-APPLICATION,logical_replication.events_dlqed_by_label,Row update events sent to DLQ by label,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
+APPLICATION,logical_replication.events_dlqed_by_label,Row update events sent to DLQ by label,Failures,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.events_dlqed_errtype,Row update events sent to DLQ due to an error not considered retryable,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.events_dlqed_space,Row update events sent to DLQ due to capacity of the retry queue,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.events_ingested,Events ingested by all replication jobs,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
-APPLICATION,logical_replication.events_ingested_by_label,Events ingested by all replication jobs by label,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
+APPLICATION,logical_replication.events_ingested_by_label,Events ingested by all replication jobs by label,Events,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.events_initial_failure,Failed attempts to apply an incoming row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.events_initial_success,Successful applications of an incoming row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.events_retry_failure,Failed re-attempts to apply a row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
@@ -2291,10 +2296,12 @@ APPLICATION,logical_replication.kv.update_too_old,Total number of updates that w
 APPLICATION,logical_replication.kv.value_refreshes,Total number of batches that refreshed the previous value,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.logical_bytes,Logical bytes (sum of keys + values) received by all replication jobs,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.replan_count,Total number of dist sql replanning events,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
-APPLICATION,logical_replication.replicated_time_by_label,Replicated time of the logical replication stream by label,Seconds,COUNTER,SECONDS,AVG,NON_NEGATIVE_DERIVATIVE
+APPLICATION,logical_replication.replicated_time_by_label,Replicated time of the logical replication stream by label,Seconds,GAUGE,SECONDS,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,logical_replication.replicated_time_seconds,The replicated time of the logical replication stream in seconds since the unix epoch.,Seconds,GAUGE,SECONDS,AVG,NONE
 APPLICATION,logical_replication.retry_queue_bytes,The replicated time of the logical replication stream in seconds since the unix epoch.,Bytes,GAUGE,BYTES,AVG,NONE
 APPLICATION,logical_replication.retry_queue_events,The replicated time of the logical replication stream in seconds since the unix epoch.,Events,GAUGE,COUNT,AVG,NONE
+APPLICATION,logical_replication.scanning_ranges,Source side ranges undergoing an initial scan (inaccurate with multiple LDR jobs),Ranges,GAUGE,COUNT,AVG,NONE
+APPLICATION,logical_replication.scanning_ranges_by_label,Source side ranges undergoing an initial scan,Ranges,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,obs.tablemetadata.update_job.duration,Time spent running the update table metadata job.,Duration,HISTOGRAM,NANOSECONDS,AVG,NONE
 APPLICATION,obs.tablemetadata.update_job.errors,The total number of errors that have been emitted from the update table metadata job.,Errors,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
 APPLICATION,obs.tablemetadata.update_job.runs,The total number of runs of the update table metadata job.,Executions,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE

diff --git a/src/current/_includes/v24.3/child-metrics-table.md b/src/current/_includes/v24.3/child-metrics-table.md
@@ -7,7 +7,7 @@ Following is a list of the metrics that have child metrics:
     <thead>
         <tr>
             <td><b>CockroachDB Metric Name</b></td>
-            <td><b>Description When Aggregated</b></td>
+            <td><b>{% if feature == "ldr" %}Description{% else %}Description When Aggregated{% endif %}</b></td>
             <td><b>Type</b></td>
             <td><b>Unit</b></td>
         </tr>

diff --git a/src/current/v24.3/child-metrics.md b/src/current/v24.3/child-metrics.md
@@ -110,6 +110,41 @@ changefeed_error_retries{node_id="1",scope="office_dogs"} 0
 {% assign feature = "changefeed" %}
 {% include {{ page.version.version }}/child-metrics-table.md %}
 
+## Clusters with logical data replication jobs
+
+When child metrics is enabled and [logical data replication (LDR) jobs with metrics labels]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}#metrics-labels) are created on the cluster, the `logical_replication_*_by_label` metrics are exported per LDR metric label. The `label` may have the values set using the `label` option. The cardinality increases with the number of LDR metric labels.
+
+For example, when you create two LDR jobs with the metrics labels `ldr_job1` and `ldr_job2`, the metrics `logical_replication_*_by_label` export child metrics with a `label` for `ldr_job1` and `ldr_job2`.
+
+~~~
+# HELP logical_replication_replicated_time_by_label Replicated time of the logical replication stream by label
+# TYPE logical_replication_replicated_time_by_label gauge
+logical_replication_replicated_time_by_label{label="ldr_job2",node_id="2"} 1.73411035e+09
+logical_replication_replicated_time_by_label{label="ldr_job1",node_id="2"} 1.73411035e+09
+# HELP logical_replication_catchup_ranges_by_label Source side ranges undergoing catch up scans
+# TYPE logical_replication_catchup_ranges_by_label gauge
+logical_replication_catchup_ranges_by_label{label="ldr_job1",node_id="2"} 0
+logical_replication_catchup_ranges_by_label{label="ldr_job2",node_id="2"} 0
+# HELP logical_replication_scanning_ranges_by_label Source side ranges undergoing an initial scan
+# TYPE logical_replication_scanning_ranges_by_label gauge
+logical_replication_scanning_ranges_by_label{label="ldr_job1",node_id="2"} 0
+logical_replication_scanning_ranges_by_label{label="ldr_job2",node_id="2"} 0
+~~~
+
+Note that the `logical_replication_*` metrics without the `_by_label` suffix may be `inaccurate with multiple LDR jobs`.
+
+~~~
+# HELP logical_replication_catchup_ranges Source side ranges undergoing catch up scans (inaccurate with multiple LDR jobs)
+# TYPE logical_replication_catchup_ranges gauge
+logical_replication_catchup_ranges{node_id="2"} 0
+# HELP logical_replication_scanning_ranges Source side ranges undergoing an initial scan (inaccurate with multiple LDR jobs)
+# TYPE logical_replication_scanning_ranges gauge
+logical_replication_scanning_ranges{node_id="2"} 0
+~~~
+
+{% assign feature = "ldr" %}
+{% include {{ page.version.version }}/child-metrics-table.md %}
+
 ## Clusters with row-level TTL jobs
 
 When child metrics is enabled and [row-level TTL jobs]({% link {{ page.version.version }}/row-level-ttl.md %}) are created on the cluster with the [`ttl_label_metrics` storage parameter enabled]({% link {{ page.version.version }}/row-level-ttl.md %}#ttl-metrics), the `jobs.row_level_ttl.*` metrics are exported per TTL job with `ttl_label_metrics` enabled with a label for `relation`. The value of the `relation` label may have the format: `{database}_{schema}_{table}_{primary key}`. The cardinality increases with the number of TTL jobs with `ttl_label_metrics` enabled. An aggregated metric is also included.

diff --git a/src/current/v24.3/logical-data-replication-monitoring.md b/src/current/v24.3/logical-data-replication-monitoring.md
@@ -114,11 +114,11 @@ You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer
 
 To view metrics at the job level, you can use the `label` option when you start LDR to add a metrics label to the LDR job. This enables [child metric]({% link {{ page.version.version }}/child-metrics.md %}) export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:
 
-- `logical_replication.replicated_time_seconds`
-- `logical_replication.events_ingested`
-- `logical_replication.events_dlqed`
-- `logical_replication.scanning_ranges`
-- `logical_replication.catchup_ranges`
+- `logical_replication.catchup_ranges_by_label`
+- `logical_replication.events_dlqed_by_label`
+- `logical_replication.events_ingested_by_label`
+- `logical_replication.replicated_time_by_label`
+- `logical_replication.scanning_ranges_by_label`
 
 To use metrics labels, ensure you have enabled the child metrics cluster setting:
 
@@ -136,7 +136,7 @@ ON 'external://{source_external_connection}'
 INTO TABLE {database.public.table_name} WITH label=ldr_job;
 ~~~
 
-For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}) page.
+For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}#clusters-with-logical-data-replication-jobs) page.
 
 ### Datadog
 

diff --git a/src/current/v24.3/ui-logical-data-replication-dashboard.md b/src/current/v24.3/ui-logical-data-replication-dashboard.md
@@ -1,6 +1,6 @@
 ---
 title: Logical Data Replication Dashboard
-summary: The Physical Cluster Replication Dashboard lets you monitor and observe replication streams between a primary and standby cluster.
+summary: The Logical Data Replication Dashboard lets you monitor and observe logical data replication jobs on the destination cluster.
 toc: true
 docs_area: reference.db_console
 ---