Skip to content

Commit

Permalink
DOC-11722 Create a section on the Child Metrics page for LDR (#19230)
Browse files Browse the repository at this point in the history
(1) In child-metrics.md, added section for Clusters with logical data replication jobs.
(2) In child-metrics.yml, added ldr metrics.
(3) In metrics-list.csv, with v24.3.0 binary, ran cockroach gen metric-list --format=csv > metrics-list.csv and manually lower-cased column headers and corrected spelling of inaccurate and changed COUNTER to GAUGE for five _by_label metrics.
(4) In logical-data-replication-monitoring, (a) updated link to new section in child-metrics.md and (b) updated metrics to _by_label names.
(5) In child-metric-table.md include file, modified liquid code to display Description as column header for ldr feature.
(6) In ui-logical-data-replication-dashboard.md, corrected the summary in the frontmatter.
  • Loading branch information
florence-crl authored Dec 16, 2024
1 parent fef0a38 commit 6c1734c
Show file tree
Hide file tree
Showing 6 changed files with 70 additions and 12 deletions.
18 changes: 17 additions & 1 deletion src/current/_data/metrics/child-metrics.yml
Original file line number Diff line number Diff line change
Expand Up @@ -233,4 +233,20 @@
feature: all

- child_metric_id: rpc.connection.avg_round_trip_latency
feature: all
feature: all

- child_metric_id: logical_replication.catchup_ranges_by_label
feature: ldr

- child_metric_id: logical_replication.events_dlqed_by_label
feature: ldr

- child_metric_id: logical_replication.events_ingested_by_label
feature: ldr

- child_metric_id: logical_replication.replicated_time_by_label
feature: ldr

- child_metric_id: logical_replication.scanning_ranges_by_label
feature: ldr

13 changes: 10 additions & 3 deletions src/current/_data/metrics/metrics-list.csv
Original file line number Diff line number Diff line change
Expand Up @@ -978,6 +978,8 @@ A likely cause of having a checkpoint is that one of the ranges in this store
had inconsistent data among its replicas. Such checkpoint directories are
located in auxiliary/checkpoints/rN_at_M, where N is the range ID, and M is the
Raft applied index at which this checkpoint was taken.",Directories,GAUGE,COUNT,AVG,NONE
STORAGE,storage.compactions.cancelled.bytes,Cumulative volume of data written to sstables during compactions that were ultimately cancelled due to a conflicting operation.,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
STORAGE,storage.compactions.cancelled.count,Cumulative count of compactions that were cancelled before they completed due to a conflicting operation.,Compactions,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
STORAGE,storage.compactions.duration,"Cumulative sum of all compaction durations.

The rate of this value provides the effective compaction concurrency of a store,
Expand Down Expand Up @@ -1308,6 +1310,7 @@ APPLICATION,distsender.batch_responses.cross_zone.bytes,"Total byte count of rep
monitor the data transmitted.",Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,distsender.batch_responses.replica_addressed.bytes,Total byte count of replica-addressed batch responses received,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,distsender.batches,Number of batches processed,Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,distsender.batches.async.in_progress,Number of partial batches currently being executed asynchronously,Partial Batches,GAUGE,COUNT,AVG,NONE
APPLICATION,distsender.batches.async.sent,Number of partial batches sent asynchronously,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,distsender.batches.async.throttled,Number of partial batches not sent asynchronously due to throttling,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,distsender.batches.partial,Number of partial batches processed after being divided on range boundaries,Partial Batches,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
Expand Down Expand Up @@ -2274,15 +2277,17 @@ APPLICATION,kv.protectedts.reconciliation.num_runs,number of successful reconcil
APPLICATION,kv.protectedts.reconciliation.records_processed,number of records processed without error during reconciliation on this node,Count,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,kv.protectedts.reconciliation.records_removed,number of records removed during reconciliation runs on this node,Count,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.batch_hist_nanos,Time spent flushing a batch,Nanoseconds,HISTOGRAM,NANOSECONDS,AVG,NONE
APPLICATION,logical_replication.catchup_ranges,Source side ranges undergoing catch up scans (inaccurate with multiple LDR jobs),Ranges,GAUGE,COUNT,AVG,NONE
APPLICATION,logical_replication.catchup_ranges_by_label,Source side ranges undergoing catch up scans,Ranges,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.checkpoint_events_ingested,Checkpoint events ingested by all replication jobs,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.commit_latency,"Event commit latency: a difference between event MVCC timestamp and the time it was flushed into disk. If we batch events, then the difference between the oldest event in the batch and flush is recorded",Nanoseconds,HISTOGRAM,NANOSECONDS,AVG,NONE
APPLICATION,logical_replication.events_dlqed,Row update events sent to DLQ,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_dlqed_age,Row update events sent to DLQ due to reaching the maximum time allowed in the retry queue,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_dlqed_by_label,Row update events sent to DLQ by label,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_dlqed_by_label,Row update events sent to DLQ by label,Failures,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_dlqed_errtype,Row update events sent to DLQ due to an error not considered retryable,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_dlqed_space,Row update events sent to DLQ due to capacity of the retry queue,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_ingested,Events ingested by all replication jobs,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_ingested_by_label,Events ingested by all replication jobs by label,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_ingested_by_label,Events ingested by all replication jobs by label,Events,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_initial_failure,Failed attempts to apply an incoming row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_initial_success,Successful applications of an incoming row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.events_retry_failure,Failed re-attempts to apply a row update,Failures,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
Expand All @@ -2291,10 +2296,12 @@ APPLICATION,logical_replication.kv.update_too_old,Total number of updates that w
APPLICATION,logical_replication.kv.value_refreshes,Total number of batches that refreshed the previous value,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.logical_bytes,Logical bytes (sum of keys + values) received by all replication jobs,Bytes,COUNTER,BYTES,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.replan_count,Total number of dist sql replanning events,Events,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.replicated_time_by_label,Replicated time of the logical replication stream by label,Seconds,COUNTER,SECONDS,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.replicated_time_by_label,Replicated time of the logical replication stream by label,Seconds,GAUGE,SECONDS,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,logical_replication.replicated_time_seconds,The replicated time of the logical replication stream in seconds since the unix epoch.,Seconds,GAUGE,SECONDS,AVG,NONE
APPLICATION,logical_replication.retry_queue_bytes,The replicated time of the logical replication stream in seconds since the unix epoch.,Bytes,GAUGE,BYTES,AVG,NONE
APPLICATION,logical_replication.retry_queue_events,The replicated time of the logical replication stream in seconds since the unix epoch.,Events,GAUGE,COUNT,AVG,NONE
APPLICATION,logical_replication.scanning_ranges,Source side ranges undergoing an initial scan (inaccurate with multiple LDR jobs),Ranges,GAUGE,COUNT,AVG,NONE
APPLICATION,logical_replication.scanning_ranges_by_label,Source side ranges undergoing an initial scan,Ranges,GAUGE,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,obs.tablemetadata.update_job.duration,Time spent running the update table metadata job.,Duration,HISTOGRAM,NANOSECONDS,AVG,NONE
APPLICATION,obs.tablemetadata.update_job.errors,The total number of errors that have been emitted from the update table metadata job.,Errors,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
APPLICATION,obs.tablemetadata.update_job.runs,The total number of runs of the update table metadata job.,Executions,COUNTER,COUNT,AVG,NON_NEGATIVE_DERIVATIVE
Expand Down
2 changes: 1 addition & 1 deletion src/current/_includes/v24.3/child-metrics-table.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Following is a list of the metrics that have child metrics:
<thead>
<tr>
<td><b>CockroachDB Metric Name</b></td>
<td><b>Description When Aggregated</b></td>
<td><b>{% if feature == "ldr" %}Description{% else %}Description When Aggregated{% endif %}</b></td>
<td><b>Type</b></td>
<td><b>Unit</b></td>
</tr>
Expand Down
35 changes: 35 additions & 0 deletions src/current/v24.3/child-metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,41 @@ changefeed_error_retries{node_id="1",scope="office_dogs"} 0
{% assign feature = "changefeed" %}
{% include {{ page.version.version }}/child-metrics-table.md %}

## Clusters with logical data replication jobs

When child metrics is enabled and [logical data replication (LDR) jobs with metrics labels]({% link {{ page.version.version }}/logical-data-replication-monitoring.md %}#metrics-labels) are created on the cluster, the `logical_replication_*_by_label` metrics are exported per LDR metric label. The `label` may have the values set using the `label` option. The cardinality increases with the number of LDR metric labels.

For example, when you create two LDR jobs with the metrics labels `ldr_job1` and `ldr_job2`, the metrics `logical_replication_*_by_label` export child metrics with a `label` for `ldr_job1` and `ldr_job2`.

~~~
# HELP logical_replication_replicated_time_by_label Replicated time of the logical replication stream by label
# TYPE logical_replication_replicated_time_by_label gauge
logical_replication_replicated_time_by_label{label="ldr_job2",node_id="2"} 1.73411035e+09
logical_replication_replicated_time_by_label{label="ldr_job1",node_id="2"} 1.73411035e+09
# HELP logical_replication_catchup_ranges_by_label Source side ranges undergoing catch up scans
# TYPE logical_replication_catchup_ranges_by_label gauge
logical_replication_catchup_ranges_by_label{label="ldr_job1",node_id="2"} 0
logical_replication_catchup_ranges_by_label{label="ldr_job2",node_id="2"} 0
# HELP logical_replication_scanning_ranges_by_label Source side ranges undergoing an initial scan
# TYPE logical_replication_scanning_ranges_by_label gauge
logical_replication_scanning_ranges_by_label{label="ldr_job1",node_id="2"} 0
logical_replication_scanning_ranges_by_label{label="ldr_job2",node_id="2"} 0
~~~

Note that the `logical_replication_*` metrics without the `_by_label` suffix may be `inaccurate with multiple LDR jobs`.

~~~
# HELP logical_replication_catchup_ranges Source side ranges undergoing catch up scans (inaccurate with multiple LDR jobs)
# TYPE logical_replication_catchup_ranges gauge
logical_replication_catchup_ranges{node_id="2"} 0
# HELP logical_replication_scanning_ranges Source side ranges undergoing an initial scan (inaccurate with multiple LDR jobs)
# TYPE logical_replication_scanning_ranges gauge
logical_replication_scanning_ranges{node_id="2"} 0
~~~

{% assign feature = "ldr" %}
{% include {{ page.version.version }}/child-metrics-table.md %}

## Clusters with row-level TTL jobs

When child metrics is enabled and [row-level TTL jobs]({% link {{ page.version.version }}/row-level-ttl.md %}) are created on the cluster with the [`ttl_label_metrics` storage parameter enabled]({% link {{ page.version.version }}/row-level-ttl.md %}#ttl-metrics), the `jobs.row_level_ttl.*` metrics are exported per TTL job with `ttl_label_metrics` enabled with a label for `relation`. The value of the `relation` label may have the format: `{database}_{schema}_{table}_{primary key}`. The cardinality increases with the number of TTL jobs with `ttl_label_metrics` enabled. An aggregated metric is also included.
Expand Down
12 changes: 6 additions & 6 deletions src/current/v24.3/logical-data-replication-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,11 @@ You can use Prometheus and Alertmanager to track and alert on LDR metrics. Refer

To view metrics at the job level, you can use the `label` option when you start LDR to add a metrics label to the LDR job. This enables [child metric]({% link {{ page.version.version }}/child-metrics.md %}) export, which are Prometheus time series with extra labels. You can track the following metrics for an LDR job with labels:

- `logical_replication.replicated_time_seconds`
- `logical_replication.events_ingested`
- `logical_replication.events_dlqed`
- `logical_replication.scanning_ranges`
- `logical_replication.catchup_ranges`
- `logical_replication.catchup_ranges_by_label`
- `logical_replication.events_dlqed_by_label`
- `logical_replication.events_ingested_by_label`
- `logical_replication.replicated_time_by_label`
- `logical_replication.scanning_ranges_by_label`

To use metrics labels, ensure you have enabled the child metrics cluster setting:

Expand All @@ -136,7 +136,7 @@ ON 'external://{source_external_connection}'
INTO TABLE {database.public.table_name} WITH label=ldr_job;
~~~

For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}) page.
For a full reference on tracking metrics with labels, refer to the [Child Metrics]({% link {{ page.version.version }}/child-metrics.md %}#clusters-with-logical-data-replication-jobs) page.

### Datadog

Expand Down
2 changes: 1 addition & 1 deletion src/current/v24.3/ui-logical-data-replication-dashboard.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Logical Data Replication Dashboard
summary: The Physical Cluster Replication Dashboard lets you monitor and observe replication streams between a primary and standby cluster.
summary: The Logical Data Replication Dashboard lets you monitor and observe logical data replication jobs on the destination cluster.
toc: true
docs_area: reference.db_console
---
Expand Down

0 comments on commit 6c1734c

Please sign in to comment.