Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fire SmartHealthStatusFail only for physical devices #378

Closed
wants to merge 3 commits into from

Conversation

Deezzir
Copy link
Contributor

@Deezzir Deezzir commented Dec 14, 2024

Found a possible solution for the issue. The smartctl exporter has the following metrics:

smartctl_device_block_size{blocks_type="physical",device="sda"} 0
smartctl_device_block_size{blocks_type="logical",device="sda"} 512

We can check if the physical size of the drive is 0, meaning it's a logical device, and ignore

smartctl_device_smart_status{device="sda"} == 0

Some context:

The exporter doesn't distinguish between logical and physical block devices and exports metrics for anything listed by smartctl --scan

And there is no concrete way to determine that through the metrics.

Closes: #357

jneo8
jneo8 previously approved these changes Dec 16, 2024
Copy link
Contributor

@jneo8 jneo8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice found

@@ -33,7 +33,7 @@ groups:
LABELS = {{ $labels }}

- alert: SmartHealthStatusFail
expr: smartctl_device_smart_status == 0
expr: (smartctl_device_smart_status == 0) and on(device, juju_unit) (smartctl_device_block_size{blocks_type="physical"} != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking):

It will be more nice if you can provide some information as comment here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think there is an incorrect assumption here re: "physical" versus "logical" block size.

I tested this on my local machine. I have an NVME, and it does show up in SMART. It doesn't return a physical block size, but only a logical block size, via smartctl.

There is also the concept of logical block sizes on disks.

That is, this likely does not have to do with whether a device is a "real" physical device or a simulated device like a VM disk, but rather with those technical disk parameters.

I installed prometheus on my laptop and did a quick check against its NVME; this is what I see:

paul-goins@pdg-thinkpad:~/code/hardware-observer-operator$ curl --silent http://localhost:9633/metrics | grep block_size
# HELP smartctl_device_block_size Device block size
# TYPE smartctl_device_block_size gauge
smartctl_device_block_size{blocks_type="logical",device="nvme0"} 512
smartctl_device_block_size{blocks_type="physical",device="nvme0"} 0

As written, this check will miss any disks - physical disks - which do not report a physical block size but only a logical one.

Also if you're curious, I poked around the sources of the prometheus plugin and extracted what appears to be the smartctl command it runs to pull the block sizes. This is what I get:

paul-goins@pdg-thinkpad:~/code/hardware-observer-operator$ sudo smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/nvme0 | grep block_size
  "logical_block_size": 512,

You could try this on your own machine if curious.


TLDR: Unfortunately -1; I think we need a different methodology here.

@@ -33,7 +33,7 @@ groups:
LABELS = {{ $labels }}

- alert: SmartHealthStatusFail
expr: smartctl_device_smart_status == 0
expr: (smartctl_device_smart_status == 0) and on(device, juju_unit) (smartctl_device_block_size{blocks_type="physical"} != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to filter by blocks_type="physical" ? Can't we assume if it has size 0, we should not check it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that way it's more explicit

Copy link
Member

@gabrielcocenza gabrielcocenza Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you understood my question.

I don't know exactly how many block_types that exists on smartctl, let's say that we have:

  • logical
  • physical
  • foo
  • bar
  • ...

With this query we are just joining when physical. Does it make sense to trigger the alert, for example, if the block type is logical with size 0 and status 0 ?

aieri
aieri previously approved these changes Dec 16, 2024
Copy link
Contributor

@aieri aieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I can't tell if it's always going to be true that logical devices report a physical size of 0, this seems a nice improvement

@@ -33,7 +33,7 @@ groups:
LABELS = {{ $labels }}

- alert: SmartHealthStatusFail
expr: smartctl_device_smart_status == 0
expr: (smartctl_device_smart_status == 0) and on(device, juju_unit) (smartctl_device_block_size{blocks_type="physical"} != 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

@Deezzir Deezzir dismissed stale reviews from aieri and jneo8 via f292486 December 16, 2024 23:59
@Deezzir Deezzir requested review from jneo8 and aieri December 16, 2024 23:59
Copy link

@Vultaire Vultaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this check will unintentionally avoid alerting for any physical disks which only report a logical block size - like the physical NVMe, which does have SMART support, which is within my laptop. Thus, I'm -1 on this.

@@ -33,7 +33,7 @@ groups:
LABELS = {{ $labels }}

- alert: SmartHealthStatusFail
expr: smartctl_device_smart_status == 0
expr: (smartctl_device_smart_status == 0) and on(device, juju_unit) (smartctl_device_block_size{blocks_type="physical"} != 0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think there is an incorrect assumption here re: "physical" versus "logical" block size.

I tested this on my local machine. I have an NVME, and it does show up in SMART. It doesn't return a physical block size, but only a logical block size, via smartctl.

There is also the concept of logical block sizes on disks.

That is, this likely does not have to do with whether a device is a "real" physical device or a simulated device like a VM disk, but rather with those technical disk parameters.

I installed prometheus on my laptop and did a quick check against its NVME; this is what I see:

paul-goins@pdg-thinkpad:~/code/hardware-observer-operator$ curl --silent http://localhost:9633/metrics | grep block_size
# HELP smartctl_device_block_size Device block size
# TYPE smartctl_device_block_size gauge
smartctl_device_block_size{blocks_type="logical",device="nvme0"} 512
smartctl_device_block_size{blocks_type="physical",device="nvme0"} 0

As written, this check will miss any disks - physical disks - which do not report a physical block size but only a logical one.

Also if you're curious, I poked around the sources of the prometheus plugin and extracted what appears to be the smartctl command it runs to pull the block sizes. This is what I get:

paul-goins@pdg-thinkpad:~/code/hardware-observer-operator$ sudo smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/nvme0 | grep block_size
  "logical_block_size": 512,

You could try this on your own machine if curious.


TLDR: Unfortunately -1; I think we need a different methodology here.

@aieri
Copy link
Contributor

aieri commented Dec 17, 2024

Thanks for catching this @Vultaire.
Here's a different approach proposal: smartctl offers a field "smart_support", which should in theory only be set to true if the device is actually a physical one (obviously a logical one cannot support smart, unless it's lying)

On my laptop, as an example:

# local nvme from my laptop
sudo smartctl -a /dev/nvme0n1 -j | jq '.smart_support.available'
true

I don't have a machine at hand with a hardware raid, I would hope it returns false. If that's true, this bug could be turned into an upstream smartctl-exporter bug: metrics shouldn't be produced for a device that doesn't support smart.

@Deezzir
Copy link
Contributor Author

Deezzir commented Dec 17, 2024

Thanks for catching this @Vultaire. Here's a different approach proposal: smartctl offers a field "smart_support", which should in theory only be set to true if the device is actually a physical one (obviously a logical one cannot support smart, unless it's lying)

On my laptop, as an example:

# local nvme from my laptopsmartctl_device_smart_status
sudo smartctl -a /dev/nvme0n1 -j | jq '.smart_support.available'
true

I don't have a machine at hand with a hardware raid, I would hope it returns false. If that's true, this bug could be turned into an upstream smartctl-exporter bug: metrics shouldn't be produced for a device that doesn't support smart.

No, it does not have such a metric exported. A potential candidate for upstream contribution.

A more dirty solution would be to check the firmware version of the device using:

smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sda",firmware_version="",form_factor="",interface="scsi",model_family="unknown",model_name="FTS PRAID EP400i",protocol="SCSI",sata_version="",scsi_product="PRAID EP400i",scsi_revision="4.68",scsi_vendor="FTS",scsi_version="SPC-3",serial_number="00059a7e0b9af9bb26a0df4e02570003"} 1

So far, I can see that the logical devices like HW Raids does not the field set(empty)

sudo smartctl -a /dev/sda -j | jq '.smart_support.available'
null

@aieri

@Deezzir
Copy link
Contributor Author

Deezzir commented Dec 17, 2024

But I would rather go for upstream contribution

@aieri
Copy link
Contributor

aieri commented Dec 17, 2024

btw I'm not suggesting smartctl-exporter exposes a new metrics for smart_support, I think it should simply filter out from smartctl --scan any block device that doesn't offer smart support

@Deezzir
Copy link
Contributor Author

Deezzir commented Dec 17, 2024

But that would require additional logic for hwo to detect and filter out the devices

@aieri
Copy link
Contributor

aieri commented Dec 17, 2024

But that would require additional logic for hwo to detect and filter out the devices

Yes, in fact I think it should be done directly at the exporter level. Why export metrics for a logical device that doesn't offer any smart metric?

@jneo8
Copy link
Contributor

jneo8 commented Dec 17, 2024

Yes, in fact I think it should be done directly at the exporter level. Why export metrics for a logical device that doesn't offer any smart metric?

Agree.

@Deezzir
Copy link
Contributor Author

Deezzir commented Dec 20, 2024

Will close the PR as the fix requires upstream change: prometheus-community/smartctl_exporter#260

@Deezzir Deezzir closed this Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Smart health related checks are executed against hardware raids
5 participants