Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PeerDAS metrics: add data column, kzg, custody metrics #14

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 21 additions & 1 deletion metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,30 @@ The following are the minimal metrics agreed to be conformed by the various clie
| `beacon_processed_deposits_total` | Gauge | Total number of deposits processed | On epoch transition |

\* All `*_root` values are converted to signed 64-bit integers utilizing the last 8 bytes interpreted as little-endian (`int.from_bytes(root[24:32], byteorder='little', signed=True)`).

### PeerDAS Metrics

The following metrics are proposed to be added to clients for PeerDAS monitoring. This list is open for discussion. Each client has the opportunity to contribute to it by suggesting additions or disputing existing metrics.

#### Data column, kzg, custody metrics

| Name | Metric type | Usage | Sample collection event |
|--------------------------------------------|-------------|-------------------------------------------------------------|----------------------|
| `beacon_data_column_sidecar_processing_requests_total` | Counter | Number of data column sidecars submitted for processing | On data column sidecar gossip verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name and description may be a bit confusing here, as "processing" could mean different things in different clients. The description makes sense in Lighthouse because we have a task scheduling component called BeaconProcessor, and all the tasks are "submitted" to this scheduler for processing - so you see this terminology used quite frequently in our metrics.

It's more of a convenient metric rather than a must-have in Lighthouse, as we can get the same data with count(beacon_data_column_sidecar_gossip_verification_seconds). IMO we should probably minimise the number of standardise metrics, so we don't force all clients to implement metrics that aren't necessary for them. It comes with maintenance cost (once introduced, renaming / removing would be a breaking change) as well as extra prometheus storage cost.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may still make sense to have this metric though.

For a few of our metrics we discard the timer if the operations fails, so it may not reflect the actual attemtpt count.

Is it also worth mentioning whether we should discard a timer metric if the operation fails?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, I see this implemented in Teku already, if it make sense feel free to ignore the above comment!

| `beacon_data_column_sidecar_processing_successes_total` | Counter | Number of data column sidecars verified for gossip | On data column sidecar gossip verification |
| `beacon_data_column_sidecar_gossip_verification_milliseconds` | Histogram | Full runtime of data column sidecars gossip verification | On data column sidecar gossip verification |
| `beacon_data_availability_reconstructed_columns_total` | Counter | Total count of reconstructed columns | On data column kzg verification |
| `beacon_data_availability_reconstruction_time_milliseconds` | Histogram | Time taken to reconstruct columns | On data column kzg verification |
| `beacon_data_column_sidecar_computation_seconds` | Histogram | Time taken to compute data column sidecar, including cells, proofs and inclusion proof | On data column sidecar computation |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth suggesting a historgram bucket? e.g. [0.1, 0.15, 0.25, 0.35, 0.5, 0.7, 1.0, 2.5, 5.0, 10.0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've also implemented a blob_count label, so that we can individual timings for different blob counts.

| `beacon_data_column_sidecar_inclusion_proof_verification_milliseconds` | Histogram | Time taken to verify data column sidecar inclusion proof | On data column sidecar inclusion proof verification |
| `beacon_kzg_verification_data_column_single_milliseconds` | Histogram | Runtime of single data column kzg verification | On single data column kzg verification |
| `beacon_kzg_verification_data_column_batch_milliseconds` | Histogram | Runtime of batched data column kzg verification | On batched data column kzg verification |
| `beacon_custody_columns_count_total` | Counter | Total count of columns in custody within the data availability boundary | On custody collecting and verification |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another metric for tracking the custody count?


### Additional Metrics

The following are proposed metrics to be added to clients. This list is _not_ stable and is subject to drastic changes, deletions, and additions. The additional metric list is being
discussed, we are yet to reach consensus. Ideally we would also discuss which of these values need to be counters, guages or histograms.
discussed, we are yet to reach consensus. Ideally we would also discuss which of these values need to be counters, guages or histograms.

| Name | Metric type | Usage | Sample collection event |
|----------------------------------------------|-------------|--------------------------------------------------------------------------------------|---------------------|
Expand Down