nvme_metrics: refactor metrics to better fit the verbose / nested JSON output of nvme-cli #228

dswarbrick · 2024-11-19T17:12:52Z

This PR contains breaking changes.

Change nvme_nvmecli metric type to Info, which also adds a suffix to the metric name, i.e., nvme_nvmecli_info.
Refactor nvme_device_info metric as nvme_controller_info, adding a transport label. For example this:

nvme_device_info{device="nvme0n1",firmware="41001131",model="PC711 NVMe SK hynix 1TB",serial="KNA8..."} 1.0

becomes:

nvme_controller_info{controller="nvme0",firmware="41001131",model="PC711 NVMe SK hynix 1TB",serial="KNA8...",transport="pcie"} 1.0

Fetch device global SMART log once per controller
Change device label in controller-specific metrics to controller. This also means that the label value will be the NVMe character device name, e.g. "nvme0", instead of the previously used namespace block device name, e.g. "nvme0n1".
Introduce nvme_namespace_info metric to facilitate join-relationship between namespace-specific and controller-specific metrics.

This is a breaking change, as it renames the metric nvme_nvmecli to nvme_nvmecli_info. Signed-off-by: Daniel Swarbrick <[email protected]>

This is a breaking change, as it renames the existing metric nvme_device_info to nvme_controller_info. The previous "device" label is now "controller", and takes the form of e.g. "nvme0" instead of "nvme0n1". A new label "transport" is also added to the renamed metric. Signed-off-by: Daniel Swarbrick <[email protected]>

dswarbrick · 2024-11-19T17:24:36Z

@sbates130272 Can you perhaps confirm that the nvme smart-log command is indeed controller-specific, not namespace-specific? I noticed that nvme-cli will work with either e.g. /dev/nvme0 or /dev/nvme0n1, but as far as I can tell, the data returned are controller-specific. I don't think counters such as data_units_written are tracked per namespace.

I would like to refactor a few of the metrics that we currently expose as "per-device" (which is a misnomer), and instead expose them per controller. This will unfortunately affect about 15 metrics, but I feel that it is important to get this right.

For example, this:

nvme_power_on_hours_total{device="nvme0n1"} 16170.0

would become:

nvme_power_on_hours_total{controller="nvme0"} 16170.0

... as I don't think it's possible for a namespace to be powered on for a different number of hours than another namespace on the same controller.

The metric / labels implemented by this collector hark back to the original nvme_metrics.sh collector, and were based on the wrong assumption that all NVMe devices are single-namespace - probably due to only having consumer-grade SSDs to test it on at the time.

sbates130272 · 2024-11-19T20:10:22Z

@dswarbrick I did a quick check and depending on the NVMe device it may support SMART log pages on a per namespace basis or it may not. The ability to do this on a per namespace basis is advertised via a field in the controller. Here is the help information for the nvme smart-log command.

NVME-SMART-LOG(1)                                                              NVMe Manual                                                             NVME-SMART-LOG(1)

NAME
       nvme-smart-log - Send NVMe SMART log page request, returns result and log

SYNOPSIS
       nvme smart-log <device> [--namespace-id=<nsid> | -n <nsid>]
                               [--raw-binary | -b]
                               [--output-format=<fmt> | -o <fmt>] [--verbose | -v]

DESCRIPTION
       Retrieves the NVMe SMART log page from an NVMe device and provides the returned structure.

       The <device> parameter is mandatory and may be either the NVMe character device (ex: /dev/nvme0), or a namespace block device (ex: /dev/nvme0n1).

       On success, the returned smart log structure may be returned in one of several ways depending on the option flags; the structure may parsed by the program and
       printed in a readable format or the raw buffer may be printed to stdout for another program to parse.

OPTIONS
       -n <nsid>, --namespace-id=<nsid>
           Retrieve the SMART log for the given nsid. This is optional and its success may depend on the device’s capabilities to provide this log on a per-namespace
           basis (see the NVMe Identify Controller for this capability). The default nsid to use is 0xffffffff for the device global SMART log.

       -b, --raw-binary
           Print the raw SMART log buffer to stdout.

       -o <fmt>, --output-format=<fmt>
           Set the reporting format to normal, json or binary. Only one output format can be used at a time.

       -v, --verbose
           Increase the information detail in the output.

EXAMPLES
       •   Print the SMART log page in a human readable format:

               # nvme smart-log /dev/nvme0

       •   Print the raw SMART log to a file:

I will be honest and say I have no idea what percentage of drives in the field or on the market support the smart-log information on a per namespace basis.

sbates130272 · 2024-11-19T20:25:58Z

@dswarbrick is it worth sending an email to the linux-nvme mailing list so that the people on that list can chime in on this?

dswarbrick · 2024-11-19T20:37:58Z

@sbates130272 What I am a little unclear about is whether there is any difference between passing e.g. /dev/nvme0 and /dev/nvme0n1 as the device to the nvme smart-log command. I get the impression that nvme-cli derives the controller, e.g. nvme0, regardless of whether /dev/nvme0 or /dev/nvme0n1 is specified. This is of course also assuming that the nsid is not specified with --namespace-id.

Basically the issue is whether we are making a wrong assumption by tagging such metrics with a device="nvme0n1" label, when in fact they should be tagged with controller="nvme0".

In other words, I think it's sufficient for the collector to only fetch the global SMART log for now, but that we should reflect that in the labeling convention. Tagging the SMART metrics with device="nvme0n1" gives the impression that they are specific to that namespace, whereas nvme smart-log /dev/nvme0n1 is to my understanding not the same as nvme smart-log /dev/nvme0 --namespace-id 1.

sbates130272 · 2024-11-20T16:56:22Z

@dswarbrick I took a look at the latest version of the NVMe specification and the relevant code in nvme-cli and it looks like nvme-cli will return a controller-based SMART page unless the --namespace-id is set. In which case it will return a namespace-based SMART log (or an error if the device does not support namespace-based SMART log pages). So I concur with your assumption that a controller-based SMART log page is always returned unless --namespace-id is set.

dswarbrick · 2024-11-21T05:56:42Z

@sbates130272 Thanks for researching that. So I think that, as I alluded to with the FIXME in the current code, it would make sense to just fetch the global SMART log once per controller, i.e., without specifying a namespace, and change the device label to controller. If there is a strong need in future, we could fetch it per namespace, with the addition of a nsid label in the metric. The key thing is clear up the misunderstanding with the e.g. device="nvme0n1" labels.

Change "device" label in controller-specific metrics to "controller". This also means that the label value will be the NVMe character device name, e.g. "nvme0", instead of the previously used namespace block device name, e.g. "nvme0n1". Separate metric declaration dict into controller-specific and namespace-specific groups for easier maintenance. Signed-off-by: Daniel Swarbrick <[email protected]>

Introduce info metric to facilitate join-relationship between namespace-specific and controller-specific metrics. e.g. nvme_namespace_info{controller="nvme0",namepace="nvme0n1",nsid="1"} Signed-off-by: Daniel Swarbrick <[email protected]>

dswarbrick · 2024-11-21T06:41:50Z

@sbates130272 How does this PR look to you now?

Use "device" rather than "namespace" label so that it matches the other namespace-specific metrics. Signed-off-by: Daniel Swarbrick <[email protected]>

sbates130272

Just a few nits. Thanks for doing this!

nvme_metrics.py

Signed-off-by: Daniel Swarbrick <[email protected]>

dswarbrick added 2 commits November 19, 2024 17:57

nvme_metrics: change nvmecli metric type to Info

e016bbb

This is a breaking change, as it renames the metric nvme_nvmecli to nvme_nvmecli_info. Signed-off-by: Daniel Swarbrick <[email protected]>

dswarbrick added 2 commits November 21, 2024 07:36

dswarbrick force-pushed the nvme-refactor branch from 38a62bf to 895fbb1 Compare November 21, 2024 06:36

dswarbrick marked this pull request as ready for review November 21, 2024 06:41

nvme_metrics: use device label in namespace_info metric

75f33a7

Use "device" rather than "namespace" label so that it matches the other namespace-specific metrics. Signed-off-by: Daniel Swarbrick <[email protected]>

sbates130272 approved these changes Nov 21, 2024 •

edited

Loading

View reviewed changes

sbates130272 suggested changes Nov 21, 2024

View reviewed changes

nvme_metrics.py Show resolved Hide resolved

nvme_metrics.py Outdated Show resolved Hide resolved

nvme_metrics.py Outdated Show resolved Hide resolved

sbates130272 mentioned this pull request Nov 21, 2024

Field descriptors in OCP SMART Log Page are different to those in standard SMART Log Page linux-nvme/nvme-cli#2577

Closed

nvme_metrics: adjust wording of comments

6df15e5

Signed-off-by: Daniel Swarbrick <[email protected]>

dswarbrick requested a review from SuperQ November 21, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvme_metrics: refactor metrics to better fit the verbose / nested JSON output of nvme-cli #228

nvme_metrics: refactor metrics to better fit the verbose / nested JSON output of nvme-cli #228

dswarbrick commented Nov 19, 2024 •

edited

Loading

dswarbrick commented Nov 19, 2024

sbates130272 commented Nov 19, 2024 •

edited

Loading

sbates130272 commented Nov 19, 2024

dswarbrick commented Nov 19, 2024 •

edited

Loading

sbates130272 commented Nov 20, 2024 •

edited

Loading

dswarbrick commented Nov 21, 2024

dswarbrick commented Nov 21, 2024

sbates130272 left a comment

nvme_metrics: refactor metrics to better fit the verbose / nested JSON output of nvme-cli #228

Are you sure you want to change the base?

nvme_metrics: refactor metrics to better fit the verbose / nested JSON output of nvme-cli #228

Conversation

dswarbrick commented Nov 19, 2024 • edited Loading

dswarbrick commented Nov 19, 2024

sbates130272 commented Nov 19, 2024 • edited Loading

sbates130272 commented Nov 19, 2024

dswarbrick commented Nov 19, 2024 • edited Loading

sbates130272 commented Nov 20, 2024 • edited Loading

dswarbrick commented Nov 21, 2024

dswarbrick commented Nov 21, 2024

sbates130272 left a comment

Choose a reason for hiding this comment

dswarbrick commented Nov 19, 2024 •

edited

Loading

sbates130272 commented Nov 19, 2024 •

edited

Loading

dswarbrick commented Nov 19, 2024 •

edited

Loading

sbates130272 commented Nov 20, 2024 •

edited

Loading