From 9c89567a4231c66e2966b610e4f26743e95a140b Mon Sep 17 00:00:00 2001 From: Pierre-Emmanuel Cornilleau Date: Wed, 30 Oct 2024 17:19:34 +0100 Subject: [PATCH 1/5] monitoring: init --- docs/network/monitoring.md | 60 ++++++++++++++++++++++++++++++++++++++ sidebars.js | 1 + 2 files changed, 61 insertions(+) create mode 100644 docs/network/monitoring.md diff --git a/docs/network/monitoring.md b/docs/network/monitoring.md new file mode 100644 index 00000000..48d08549 --- /dev/null +++ b/docs/network/monitoring.md @@ -0,0 +1,60 @@ +--- +title: Monitoring Etherlink nodes +--- + +Monitoring the behavior of the EVM node can be partially achieved by exploring the logs or through metrics. + +## Metrics + +Metrics for the node are always available on the `/metrics` endpoint of the address and port of the `--rpc-addr` argument. One can query this using: + +```bash +curl http://:/metrics +``` + +You will be presented with the list of defined and computed metrics as follows: + +``` +#HELP metric description +#TYPE metric type +octez_metric_name{label_name=label_value} x.x +``` + +Each exported metric has the following form: + +``` +octez_evm_node_metric{label_name=label_value;...} value +``` + +Each metric name starts with `octez` as its namespace, followed by the a subsystem name `evm_node`, and the name of the metric. +It follows the [OpenMetrics specification](https://openmetrics.io/). + +A metric may provide labeled parameters which allow for different instances of the metric, with different label values. +For instance, the metric `octez_evm_node_calls_method` has a label named `method` which allows this metric to have one value for each kind of method. + +``` +#HELP octez_evm_node_calls_method Method call counts +#TYPE octez_evm_node_calls_method counter +octez_evm_node_calls_method{method="debug_traceTransaction"} 13594.000000 +octez_evm_node_calls_method{method="eth_blockNumber"} 2.000000 +... +``` + +Metrics provide information about the node in the form of a [gauge](https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#gauge) that can increase or decrease (like the base gas price of a block), +a [counter](https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#counter) that can only increase (like the number of calls to a particular method), +or a [histogram](https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram) used to track the size of events and how long they usually take (e.g., the time spent processing a block). + + +The label value is sometimes used to store information that can't be described by the metric value (which can only be a float). This is used for example by the ``octez_evm_node_info`` metric that provides the current mode of the node. + + Some of the metrics are computed when scraped from the node. As there is no rate limiter, you should consider scraping wisely and adding a proxy for a public endpoint, to limit the impact on performance. + + + ------- List of metrics TBA -------------- + +### Prometheus +### Hardware metrics +## Logs +### Loki +### Adding log sinks +## Dashboards diff --git a/sidebars.js b/sidebars.js index 080d50a2..5afb21ab 100644 --- a/sidebars.js +++ b/sidebars.js @@ -66,6 +66,7 @@ const sidebars = { 'network/smart-rollup-nodes', 'network/evm-nodes', 'network/building-kernel', + 'network/monitoring', ], }, { From e625c8273d7a4a10decc0cad45556cf2574a9b32 Mon Sep 17 00:00:00 2001 From: Pierre-Emmanuel Cornilleau Date: Wed, 30 Oct 2024 17:56:34 +0100 Subject: [PATCH 2/5] monitoring: prometheus and node exporter --- docs/network/monitoring.md | 88 +++++++++++++++++++++++++++++++++++++- 1 file changed, 87 insertions(+), 1 deletion(-) diff --git a/docs/network/monitoring.md b/docs/network/monitoring.md index 48d08549..d4251cb1 100644 --- a/docs/network/monitoring.md +++ b/docs/network/monitoring.md @@ -53,7 +53,93 @@ The label value is sometimes used to store information that can't be described b ------- List of metrics TBA -------------- ### Prometheus -### Hardware metrics + +Scraping metrics gives you instant values of the metrics. For more effective monitoring, create a time series of these metrics. + +We suggest using [Prometheus](https://prometheus.io/) for that purpose. + +First, you need to install Prometheus from your favourite package manager, e.g. + +```bash +sudo apt install prometheus +``` + +When Prometheus is installed, add the scraping job to the configuration file, by default `/etc/prometheus/prometheus.yml`. + +```yaml +- job_name: 'octez-evm-node-exporter' + scrape_interval: s + metrics_path: "/metrics" + static_configs: + - targets: [':'] +``` + +By default, the EVM node exposes its JSON RPC API endpoint to `localhost:8545`. + +Note that Prometheus can also scrape metrics from several nodes. + +Prometheus is a service, so you need to start it: + +```bash +sudo systemctl start prometheus +``` + +You may visualize the log in real time to make sure the service works correctly: + +```bash +sudo journalctl --follow --unit=prometheus.service +``` + +#### History size + +By default, Prometheus keeps 15 days of data. It may be useful to increase the history window to keep wider history. To do so, update the Prometheus service execution argument (typically, in `/etc/systemd/system/multi-user.target.wants/prometheus.service`). +To increase the history to 30 days, simply update the following line: +``` + ExecStart=/usr/bin/prometheus $ARGS --storage.tsdb.retention 30d +``` +It is also possible to limit the size of the history by adding `--storage.tsdb.retention.size 5GB`. The first limit reached will trigger the cleanup. +You can also add `--storage.tsdb.path ` to change the path were Prometheus stores data. + +There are RPCs to trigger an early garbage collection of the Prometheus data but this requires you to start the Prometheus RPC server and tweak some rights. Alternatively, the easiest way to delete the whole history is to remove the output directory (typically, `/var/lib/prometheus/metrics2/`). + +Make sure to restart the Prometheus service after updating its parameters. + +#### Hardware metrics {#hardware-metrics} + +In addition to node metrics, you may want to gather other information and statistics for effective monitoring, such as hardware metrics. + +For that purpose, we suggest using [Prometheus Node Exporter](https://github.com/prometheus/node_exporter). + +To install node exporter: + +```bash +sudo apt install prometheus-node-exporter +``` + +Prometheus Node Exporter is a service, so you need to start it: + +```bash +sudo systemctl start prometheus-node-exporter +``` + +To check that everything is working you may check the metrics endpoint: + +```bash +curl localhost:9100/metrics +``` + + +Then you need to configure a scraping job on the prometheus config: + +```yaml +- job_name: 'node-exporter' + scrape_interval: s + static_configs: + - targets: [':'] +``` + +The target address of the node exporter is typically `localhost:9100`. + ## Logs ### Loki ### Adding log sinks From ddd557ffaabd852913e89a78b96bc15a40e58188 Mon Sep 17 00:00:00 2001 From: Pierre-Emmanuel Cornilleau Date: Thu, 31 Oct 2024 15:33:17 +0100 Subject: [PATCH 3/5] monitoring: table of metrics --- docs/network/monitoring.md | 60 ++++++++++++++++++++++++++++++++------ 1 file changed, 51 insertions(+), 9 deletions(-) diff --git a/docs/network/monitoring.md b/docs/network/monitoring.md index d4251cb1..7a9d37d5 100644 --- a/docs/network/monitoring.md +++ b/docs/network/monitoring.md @@ -44,15 +44,57 @@ Metrics provide information about the node in the form of a [gauge](https://gith a [counter](https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#counter) that can only increase (like the number of calls to a particular method), or a [histogram](https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram) used to track the size of events and how long they usually take (e.g., the time spent processing a block). - -The label value is sometimes used to store information that can't be described by the metric value (which can only be a float). This is used for example by the ``octez_evm_node_info`` metric that provides the current mode of the node. - - Some of the metrics are computed when scraped from the node. As there is no rate limiter, you should consider scraping wisely and adding a proxy for a public endpoint, to limit the impact on performance. - - - ------- List of metrics TBA -------------- - -### Prometheus +### Metrics listing + +Some of the metrics are computed when scraped from the node. As there is no rate limiter, you should consider scraping wisely and adding a proxy for a public endpoint, to limit the impact on performance. + +The metric names in the table omit the namespace and subsystem for readability, so to have the full metric name you must add `octez_evm_node_`, e.g. `octez_evm_node_boostrapping`. + +| Metric names | Type | Description | Relevant mode | +|:------------------------------:|:---------:|:----------------------------------------------------------------------------------------------------------------------:|:-------------------:| +| bootstrapping | gauge | 1.0 if the EVM node is catching up with its upstream EVM node, 0.0 otherwise | Observer, RPC | +| head | gauge | Level of the node’s head | all | +| confirmed_head | gauge | Confirmed level (smart rollup node's head, ie as registered on L1) | all | +| gas_price | gauge | Base gas price of the last block | sequencer, observer | +| info | gauge | Information about the kernel (commit hash and date), the node (mode) and the targeted rollup (smart_rollup_address) | all | +| block_process_time_histogram | histogram | The time the EVM node spent processing a block. Buckets : [0.1; 0.1; 0.5; 1.; 2.; 5.; 10.; infinity] (in seconds) | Sequencer, observer | +| time_processed | counter | Time to process blocks | Sequencer, observer | +| transactions | counter | Number of transactions in the blocks | all | +| calls | summary | RPC endpoint call counts and sum of execution times | all | +| calls_method | counter | Method call counts | all | +| tx_pool | gauge | Metrics about transaction pool content (computed when scraped) | sequencer | +| blueprint_chunks_sent_on_inbox | counter | Number of blueprint chunks sent on the shared inbox | sequencer | +| blueprint_chunks_sent_on_dal | counter | Number of blueprint chunks sent on the DAL | sequencer | +| signals_sent | counter | Number of DAL import signals sent on the inbox | sequencer, observer | +| time_waiting | counter | Time spent by a request waiting for a worker (in picosecond) | sequencer, observer | +| queue_size | gauge | Size of the execution queue of simulations | sequencer, observer | +| inconsistent_da_fees | counter | Node DA fees are inconsistent with kernel ones | sequencer\* | +| confirm_gas_needed | counter | Initially provided gas was not enough, confirmation was needed | sequencer\* | + +\* `inconsistent_da_fees` and `confirm_gas_needed` are listed for exhausitivity, but are mainly useful to the development team for testing and debugging the node side validation feature. + +Some metrics have labels that store information that can't be described by the metric value (which can only be a float). +For example, the `octez_evm_node_info` metric has a label named `mode` (among others) that provides the mode of the node (RPC, observer, proxy, sequencer). +Here is an example of the content from this metric with its labels: + +``` +octez_evm_node_info{commit_hash="99aa672c", commit_date="2024-10-29 11:53:40 +0100", mode="observer", smart_rollup_address="sr1Ghq66tYK9y3r8CC1Tf8i8m5nxh8nTvZEf"} 0.000000 +``` + +This table shows the available labels: + +| Metric | Label(s) | Description | +|:------------------------------:|:----------------------:|:---------------------------------------------------------:| +| information | commit_hash | commit hash of the node version | +| information | commit_date | commit date of the node version | +| information | mode | mode of the node (RPC, observer, sequencer, proxy) | +| information | smart_rollup_address | address of the Smart Rollup monitored | +| calls | endpoint | endpoint reached (`/`, etc.) | +| calls | method | JSON RPC method called (`etc_call`, etc.) | +| calls_method | method | JSON RPC method called (`etc_call`, etc.) | +| tx_pool | kind | Either `number_of_addresses` of `number_of_transactions` | + +### Scraping metrics with prometheus Scraping metrics gives you instant values of the metrics. For more effective monitoring, create a time series of these metrics. From bf27a25a94248418691e6c0aa07cfd00d7ea1fec Mon Sep 17 00:00:00 2001 From: Pierre-Emmanuel Cornilleau Date: Tue, 12 Nov 2024 10:22:22 +0100 Subject: [PATCH 4/5] monitoring: bootstraping metric desc --- docs/network/monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/network/monitoring.md b/docs/network/monitoring.md index 7a9d37d5..5900eaf0 100644 --- a/docs/network/monitoring.md +++ b/docs/network/monitoring.md @@ -52,7 +52,7 @@ The metric names in the table omit the namespace and subsystem for readability, | Metric names | Type | Description | Relevant mode | |:------------------------------:|:---------:|:----------------------------------------------------------------------------------------------------------------------:|:-------------------:| -| bootstrapping | gauge | 1.0 if the EVM node is catching up with its upstream EVM node, 0.0 otherwise | Observer, RPC | +| bootstrapping | gauge | 0.0 if the EVM node is caught up with its upstream EVM node or 1.0 if it is in the process of bootstrapping | Observer, RPC | | head | gauge | Level of the node’s head | all | | confirmed_head | gauge | Confirmed level (smart rollup node's head, ie as registered on L1) | all | | gas_price | gauge | Base gas price of the last block | sequencer, observer | From 8fcd13a390b178f3f6b4fb90783a33c4b44c05f8 Mon Sep 17 00:00:00 2001 From: Pierre-Emmanuel Cornilleau Date: Mon, 4 Nov 2024 15:22:59 +0100 Subject: [PATCH 5/5] monitoring: logs --- docs/network/monitoring.md | 151 ++++++++++++++++++++++++++++++++++++- 1 file changed, 148 insertions(+), 3 deletions(-) diff --git a/docs/network/monitoring.md b/docs/network/monitoring.md index 5900eaf0..3d298939 100644 --- a/docs/network/monitoring.md +++ b/docs/network/monitoring.md @@ -6,7 +6,7 @@ Monitoring the behavior of the EVM node can be partially achieved by exploring t ## Metrics -Metrics for the node are always available on the `/metrics` endpoint of the address and port of the `--rpc-addr` argument. One can query this using: +Metrics for the node are always available on the `/metrics` endpoint of the address and port of the node (by default "http://127.0.0.1:8545"). One can query this using: ```bash curl http://:/metrics @@ -183,6 +183,151 @@ Then you need to configure a scraping job on the prometheus config: The target address of the node exporter is typically `localhost:9100`. ## Logs -### Loki + +The node outputs logs in the `/daily_logs` directory for major events +(such as block production, connection to rollup node, etc) and a separate set of +logs for the inner working of the internal EVM (or, more precisely, the kernel) +in the `/kernel_logs`. + +The standards logs are output in files with a daily rotation and a retention +period of 5 days. They are limited to `INFO` and `ERROR` by default, but the +EVM node can be started with the option `--verbose` for more. + +``` +... +2024-11-01T23:59:54.999-00:00 [evm_node.dev.evm_events_follower_upstream_blueprint_applied] The rollup node kernel applied blueprint 505715 leading to creating block 0xa33824839766bea9eb16e3cfb6a3e06fd5aedae1d19e50729f1575892[...]. +2024-11-01T23:59:59.598-00:00 [evm_node.dev.blueprint_application] Applied a blueprint for level 505719 at 2024-11-01T23:59:59Z containing 0 transactions for 0 gas leading to creating block 0xe09fcee34c4f3fb7ac701e0508fadac7264cf7e595621a1d69d4dc533[...]. +... +``` + +The EVM logs are stored in different files according to the context of the EVM call: creating blocks, simulation, validation of a transaction before inclusion in the tx pool, etc. + +| File | Usage | +|:-----------------------------:|:--------------------------------------:| +| daily_logs/daily-YYYYMMDD.log | Major node events | +| kernel_logs/estimate_gas | EVM logs during gas estimation | +| kernel_logs/kernel_log | EVM logs during block production | +| kernel_logs/replay | EVM logs during replay operations | +| kernel_logs/simulate_call | EVM logs during simulation | +| kernel_logs/trace_transaction | EVM logs during tracing | +| kernel_logs/tx_validity | EVM logs during transaction validation | + +### Scraping logs with Loki + +Scraping logs makes them more easily searchable, and allows creating metrics to +be used in monitoring. + +We suggest using [Loki](https://github.com/grafana/loki) for that purpose. +Loki is a server that can be queried for logs, similar to what prometheus is +for metrics. It relies on an agent to scrape and send the logs, we suggest Promtail. +Promtail will scrape the logs and push them to the Loki instance which will make +them available to the dashboard. + +To install follow the [instructions on the grafana website](https://grafana.com/docs/loki/latest/setup/install/). +Note that we don't discuss using the `Alloy` agent, but rather `Promtail`. +To install using a package manager, see the [install locally](https://grafana.com/docs/loki/latest/setup/install/local/) +section, to add the grafana package repository. + +Install Loki on the graphana server and Promtail on the EVM server. + +Once both are installed, you need to add the scraping job to the Promtail +configuration file (e.g. `/etc/promtail/config.yaml`) and specify the target +Loki server. + +```yaml +clients: +- url: +``` + +You must also configure a parsing pipeline for the logs. +Below is an example of a pipeline suitable for the daily logs. It uses `grep` +to extract the event's name from the log, to make searching the logs easier. +Alternatively, we present [later](#adding-log-sinks) a method to produce logs in a json format, and the +appropriate Promtail pipeline. + +```yaml +scrape_configs: +- job_name: "evm-node-log-exporter" + pipeline_stages: + - regex: + expression: "^(?P