-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvme_metrics: refactor metrics to better fit the verbose / nested JSON output of nvme-cli #228
base: master
Are you sure you want to change the base?
Conversation
This is a breaking change, as it renames the metric nvme_nvmecli to nvme_nvmecli_info. Signed-off-by: Daniel Swarbrick <[email protected]>
This is a breaking change, as it renames the existing metric nvme_device_info to nvme_controller_info. The previous "device" label is now "controller", and takes the form of e.g. "nvme0" instead of "nvme0n1". A new label "transport" is also added to the renamed metric. Signed-off-by: Daniel Swarbrick <[email protected]>
@sbates130272 Can you perhaps confirm that the I would like to refactor a few of the metrics that we currently expose as "per-device" (which is a misnomer), and instead expose them per controller. This will unfortunately affect about 15 metrics, but I feel that it is important to get this right. For example, this:
would become:
... as I don't think it's possible for a namespace to be powered on for a different number of hours than another namespace on the same controller. The metric / labels implemented by this collector hark back to the original nvme_metrics.sh collector, and were based on the wrong assumption that all NVMe devices are single-namespace - probably due to only having consumer-grade SSDs to test it on at the time. |
@dswarbrick I did a quick check and depending on the NVMe device it may support SMART log pages on a per namespace basis or it may not. The ability to do this on a per namespace basis is advertised via a field in the controller. Here is the help information for the
I will be honest and say I have no idea what percentage of drives in the field or on the market support the smart-log information on a per namespace basis. |
@dswarbrick is it worth sending an email to the linux-nvme mailing list so that the people on that list can chime in on this? |
@sbates130272 What I am a little unclear about is whether there is any difference between passing e.g. Basically the issue is whether we are making a wrong assumption by tagging such metrics with a In other words, I think it's sufficient for the collector to only fetch the global SMART log for now, but that we should reflect that in the labeling convention. Tagging the SMART metrics with |
@dswarbrick I took a look at the latest version of the NVMe specification and the relevant code in nvme-cli and it looks like nvme-cli will return a controller-based SMART page unless the |
@sbates130272 Thanks for researching that. So I think that, as I alluded to with the FIXME in the current code, it would make sense to just fetch the global SMART log once per controller, i.e., without specifying a namespace, and change the |
Change "device" label in controller-specific metrics to "controller". This also means that the label value will be the NVMe character device name, e.g. "nvme0", instead of the previously used namespace block device name, e.g. "nvme0n1". Separate metric declaration dict into controller-specific and namespace-specific groups for easier maintenance. Signed-off-by: Daniel Swarbrick <[email protected]>
Introduce info metric to facilitate join-relationship between namespace-specific and controller-specific metrics. e.g. nvme_namespace_info{controller="nvme0",namepace="nvme0n1",nsid="1"} Signed-off-by: Daniel Swarbrick <[email protected]>
38a62bf
to
895fbb1
Compare
@sbates130272 How does this PR look to you now? |
Use "device" rather than "namespace" label so that it matches the other namespace-specific metrics. Signed-off-by: Daniel Swarbrick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nits. Thanks for doing this!
Signed-off-by: Daniel Swarbrick <[email protected]>
This PR contains breaking changes.
Change
nvme_nvmecli
metric type to Info, which also adds a suffix to the metric name, i.e.,nvme_nvmecli_info
.Refactor
nvme_device_info
metric asnvme_controller_info
, adding atransport
label. For example this:becomes:
Fetch device global SMART log once per controller
Change
device
label in controller-specific metrics tocontroller
. This also means that the label value will be the NVMe character device name, e.g. "nvme0", instead of the previously used namespace block device name, e.g. "nvme0n1".Introduce
nvme_namespace_info
metric to facilitate join-relationship between namespace-specific and controller-specific metrics.