Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVMe metrics collector based on parallel #204

Closed

Conversation

dobbi84
Copy link

@dobbi84 dobbi84 commented Feb 6, 2024

The collector based on parallel allows to collect the metrics quicker when the node has several NVMe drives. Included a Grafana dashboard that visualizes the metrics and for each panel reports a description coming from the NVMe express specifications.

Because there are now four files NVMe related everything has been moved to a common directory.

The collector based on parallel allows to collect the metrics quicker when
the node has several NVMe drives. Included a Grafana dashboard that visualizes
the metrics and for each panel reports a description coming from the NVMe express
specifications.

Because there are now four files NVMe related everything has been moved to a
common directory.

Signed-off-by: dobbi84 <[email protected]>
@dswarbrick
Copy link
Member

Thanks for your contribution. However, when nvme_metrics.py was introduced (#155), it was with the intention that it would supersede and eventually replace the shell script variant. Keeping multiple variants of collectors is not a path we want to go down, as it is confusing for end users, and increases the burden on maintainers.

Besides, these collectors are typically only run at fairly infrequent intervals. For example, the Debian package of this repo ships with systemd timers than execute some collectors with a 15 minute interval. With such infrequent execution, is the utmost speed and parallelism that important?

@dobbi84
Copy link
Author

dobbi84 commented Feb 13, 2024

I agree with you, i fell as well on the mistake of using the bash version. I run my timers at 1m interval or less and parallel seemed a painless way of collecting fast from multiple drives.

I noticed your effort in the go-nvme and i think could be a good solution to have an exporter that does not parse the nvme-cli output and possibly aligned with the nvme spec release cycle.

I can close this the PR, if you agree.

@dswarbrick
Copy link
Member

I noticed your effort in the go-nvme and i think could be a good solution to have an exporter that does not parse the nvme-cli output and possibly aligned with the nvme spec release cycle.

I split go-nvme out of another pet project of mine, https://github.com/dswarbrick/smart, but I don't really have the time to actively work on either. These projects were really just to satisfy my curiosity when I started them way back in 2017. Nowadays, I think that Rust would be a more appropriate language in which to write low-level libraries which interact with the kernel.

Implementing comprehensive support for the NVMe standard is a mammoth task - just look at https://github.com/linux-nvme/libnvme to get an idea. So, go-nvme exists really more as just a proof of concept.

If an NVMe exporter were to be developed (in any language), it would need to run as root to be able to send the NVMe admin commands to devices. Most experienced sysadmins still have nightmares about running HTTP services as root - probably due to some decades-old RCE exploit in Apache 1.3, before dropping privileges became de facto. This is why IMHO for now, parsing the (preferably JSON) output of nvme-cli, smartctl etc. is the preferable solution, since it allows for privilege separation.

@dobbi84
Copy link
Author

dobbi84 commented Feb 15, 2024

understood thanks for the direction

@igaw
Copy link

igaw commented Feb 16, 2024

If an NVMe exporter were to be developed (in any language), it would need to run as root to be able to send the NVMe admin commands to devices.

The need for root to run nvme-cli has changed slightly recently. At least nvme list, nvme list-subsys, etc don't need root anymore. These commands are using the information from sysfs. Though the nvme smart-log is till issuing a request and this depends on root permission. But given this is a real use case we can think about adding the smart log information to sysfs as well. So I am interested to get these use case working without root, though that's just me. I can bring this topic up at the upcoming LSFMM conference. In this case it would be good to know which information needs to be available for 'non root' users.

@dobbi84
Copy link
Author

dobbi84 commented Feb 19, 2024

In this case it would be good to know which information needs to be available for 'non root' users.

i think a good start would be:

  • smart-log: the most parsed so far by existing exporters
  • ocp smart-add-log: as @jmhands reported , provides "Physical media units written" that is needed to calculate WAF.

@jmhands
Copy link

jmhands commented Feb 19, 2024

adding the standard NVMe SMART log to sysfs would be great. The temperature commands are already converted to degrees C, just make sure we export the Data Units Read and Data Units Written in bytes instead of the standard 1000 512B units.

@dswarbrick
Copy link
Member

This discussion is moving off-topic for this PR. I recommend moving the discussion to the libnvme project, since this PR will not be merged.

@dswarbrick dswarbrick closed this Feb 19, 2024
@dobbi84 dobbi84 deleted the dobbi84/nvme_metrics branch March 26, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants