Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvme_metrics: refactor metrics to better fit the verbose / nested JSON output of nvme-cli #228

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dswarbrick
Copy link
Member

@dswarbrick dswarbrick commented Nov 19, 2024

This PR contains breaking changes.

  • Change nvme_nvmecli metric type to Info, which also adds a suffix to the metric name, i.e., nvme_nvmecli_info.

  • Refactor nvme_device_info metric as nvme_controller_info, adding a transport label. For example this:

nvme_device_info{device="nvme0n1",firmware="41001131",model="PC711 NVMe SK hynix 1TB",serial="KNA8..."} 1.0

becomes:

nvme_controller_info{controller="nvme0",firmware="41001131",model="PC711 NVMe SK hynix 1TB",serial="KNA8...",transport="pcie"} 1.0
  • Fetch device global SMART log once per controller
    Change device label in controller-specific metrics to controller. This also means that the label value will be the NVMe character device name, e.g. "nvme0", instead of the previously used namespace block device name, e.g. "nvme0n1".

  • Introduce nvme_namespace_info metric to facilitate join-relationship between namespace-specific and controller-specific metrics.

This is a breaking change, as it renames the metric nvme_nvmecli to
nvme_nvmecli_info.

Signed-off-by: Daniel Swarbrick <[email protected]>
This is a breaking change, as it renames the existing metric
nvme_device_info to nvme_controller_info. The previous "device" label is
now "controller", and takes the form of e.g. "nvme0" instead of
"nvme0n1".

A new label "transport" is also added to the renamed metric.

Signed-off-by: Daniel Swarbrick <[email protected]>
@dswarbrick
Copy link
Member Author

@sbates130272 Can you perhaps confirm that the nvme smart-log command is indeed controller-specific, not namespace-specific? I noticed that nvme-cli will work with either e.g. /dev/nvme0 or /dev/nvme0n1, but as far as I can tell, the data returned are controller-specific. I don't think counters such as data_units_written are tracked per namespace.

I would like to refactor a few of the metrics that we currently expose as "per-device" (which is a misnomer), and instead expose them per controller. This will unfortunately affect about 15 metrics, but I feel that it is important to get this right.

For example, this:

nvme_power_on_hours_total{device="nvme0n1"} 16170.0

would become:

nvme_power_on_hours_total{controller="nvme0"} 16170.0

... as I don't think it's possible for a namespace to be powered on for a different number of hours than another namespace on the same controller.

The metric / labels implemented by this collector hark back to the original nvme_metrics.sh collector, and were based on the wrong assumption that all NVMe devices are single-namespace - probably due to only having consumer-grade SSDs to test it on at the time.

@sbates130272
Copy link

sbates130272 commented Nov 19, 2024

@dswarbrick I did a quick check and depending on the NVMe device it may support SMART log pages on a per namespace basis or it may not. The ability to do this on a per namespace basis is advertised via a field in the controller. Here is the help information for the nvme smart-log command.

NVME-SMART-LOG(1)                                                              NVMe Manual                                                             NVME-SMART-LOG(1)

NAME
       nvme-smart-log - Send NVMe SMART log page request, returns result and log

SYNOPSIS
       nvme smart-log <device> [--namespace-id=<nsid> | -n <nsid>]
                               [--raw-binary | -b]
                               [--output-format=<fmt> | -o <fmt>] [--verbose | -v]

DESCRIPTION
       Retrieves the NVMe SMART log page from an NVMe device and provides the returned structure.

       The <device> parameter is mandatory and may be either the NVMe character device (ex: /dev/nvme0), or a namespace block device (ex: /dev/nvme0n1).

       On success, the returned smart log structure may be returned in one of several ways depending on the option flags; the structure may parsed by the program and
       printed in a readable format or the raw buffer may be printed to stdout for another program to parse.

OPTIONS
       -n <nsid>, --namespace-id=<nsid>
           Retrieve the SMART log for the given nsid. This is optional and its success may depend on the device’s capabilities to provide this log on a per-namespace
           basis (see the NVMe Identify Controller for this capability). The default nsid to use is 0xffffffff for the device global SMART log.

       -b, --raw-binary
           Print the raw SMART log buffer to stdout.

       -o <fmt>, --output-format=<fmt>
           Set the reporting format to normal, json or binary. Only one output format can be used at a time.

       -v, --verbose
           Increase the information detail in the output.

EXAMPLES
       •   Print the SMART log page in a human readable format:

               # nvme smart-log /dev/nvme0

       •   Print the raw SMART log to a file:

I will be honest and say I have no idea what percentage of drives in the field or on the market support the smart-log information on a per namespace basis.

@sbates130272
Copy link

@dswarbrick is it worth sending an email to the linux-nvme mailing list so that the people on that list can chime in on this?

@dswarbrick
Copy link
Member Author

dswarbrick commented Nov 19, 2024

@sbates130272 What I am a little unclear about is whether there is any difference between passing e.g. /dev/nvme0 and /dev/nvme0n1 as the device to the nvme smart-log command. I get the impression that nvme-cli derives the controller, e.g. nvme0, regardless of whether /dev/nvme0 or /dev/nvme0n1 is specified. This is of course also assuming that the nsid is not specified with --namespace-id.

Basically the issue is whether we are making a wrong assumption by tagging such metrics with a device="nvme0n1" label, when in fact they should be tagged with controller="nvme0".

In other words, I think it's sufficient for the collector to only fetch the global SMART log for now, but that we should reflect that in the labeling convention. Tagging the SMART metrics with device="nvme0n1" gives the impression that they are specific to that namespace, whereas nvme smart-log /dev/nvme0n1 is to my understanding not the same as nvme smart-log /dev/nvme0 --namespace-id 1.

@sbates130272
Copy link

sbates130272 commented Nov 20, 2024

@dswarbrick I took a look at the latest version of the NVMe specification and the relevant code in nvme-cli and it looks like nvme-cli will return a controller-based SMART page unless the --namespace-id is set. In which case it will return a namespace-based SMART log (or an error if the device does not support namespace-based SMART log pages). So I concur with your assumption that a controller-based SMART log page is always returned unless --namespace-id is set.

@dswarbrick
Copy link
Member Author

@sbates130272 Thanks for researching that. So I think that, as I alluded to with the FIXME in the current code, it would make sense to just fetch the global SMART log once per controller, i.e., without specifying a namespace, and change the device label to controller. If there is a strong need in future, we could fetch it per namespace, with the addition of a nsid label in the metric. The key thing is clear up the misunderstanding with the e.g. device="nvme0n1" labels.

Change "device" label in controller-specific metrics to "controller".
This also means that the label value will be the NVMe character device
name, e.g. "nvme0", instead of the previously used namespace block
device name, e.g. "nvme0n1".

Separate metric declaration dict into controller-specific and
namespace-specific groups for easier maintenance.

Signed-off-by: Daniel Swarbrick <[email protected]>
Introduce info metric to facilitate join-relationship between
namespace-specific and controller-specific metrics.

e.g. nvme_namespace_info{controller="nvme0",namepace="nvme0n1",nsid="1"}

Signed-off-by: Daniel Swarbrick <[email protected]>
@dswarbrick dswarbrick marked this pull request as ready for review November 21, 2024 06:41
@dswarbrick
Copy link
Member Author

@sbates130272 How does this PR look to you now?

Use "device" rather than "namespace" label so that it matches the other
namespace-specific metrics.

Signed-off-by: Daniel Swarbrick <[email protected]>
sbates130272
sbates130272 approved these changes Nov 21, 2024
Copy link

@sbates130272 sbates130272 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nits. Thanks for doing this!

nvme_metrics.py Show resolved Hide resolved
nvme_metrics.py Outdated Show resolved Hide resolved
nvme_metrics.py Outdated Show resolved Hide resolved
@dswarbrick dswarbrick requested a review from SuperQ November 21, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants