Fix SmartAttributeWarning alert #375

Deezzir · 2024-12-13T03:26:15Z

Only fire the alert if the attribute_value_type="raw" is set. The issue was in the fact that there are several entries for the attribute, which includes value, worst and thresh:


smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="raw", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}
0
smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="thresh", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}
10
smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="value", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}
100
smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="worst", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}

Coming from the smartctl CLI:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

Fixes: #358

gabrielcocenza

My 2c:

This is definitely an improvement and does not make sense to take other values than the raw one.

However, I have the impression that this can still be noisy. Reading the article there are quite a lot false positives. I think we can be more strict and change the query to trigger a warn if more than one SMART attribute in a single device.

Moreover, as the article says:

The SMART stats we track, with the exception of SMART 197, are cumulative in nature, meaning we need to consider the time period over which the errors were reported.

I think we need to calculate considering by rate like more than one error in the last month, because if this is cumulative, once one error happen it will trigger the alert until the count is reset.

Deezzir · 2024-12-13T18:48:46Z

I was also thinking about increasing the window, but what would be a good value? Also, these errors are different in nature, so some of them might be more critical than others. I generally think that we should split them.

Ideally, the alert should be more complex; from my experience, the errors, if you plot them, look like an exponential graph. If the drive actually starts to fail, it gradually increases the error count. Then, the errors bump more frequently closer to death.

If we could do an alert that would check the change in error counts over some period of time, that would be better. But I am not sure if such tooling is available.

Deezzir · 2024-12-13T19:00:54Z

One of possible improvements is to use the thresh values and check against them

gabrielcocenza · 2024-12-13T19:12:17Z

I was also thinking about increasing the window, but what would be a good value?

I would say that calculate the rate in a week seems reasonable.

If we could do an alert that would check the change in error counts over some period of time, that would be better. But I am not sure if such tooling is available.

What do you mean by that? If we have the count number we can calculate the rate in prometheus like:

E.g:

rate(smartctl_device_attribute{attribute_id=~"5|187|188|197|198"}[1w]) * 86400

This query gives you the average number of errors per day over the past week.

One of possible improvements is to use the thresh values and check against them

But aren't those cumulative counters?

aieri

Given that this is only a warning, I think it's ok to look at the raw value.
However I would not use it for criticals because the value is not normalized and thus its severity cannot be immediately derived

Fix SmartAttributeWarning alert

2ec2541

Deezzir requested a review from a team as a code owner December 13, 2024 03:26

Deezzir requested review from Vultaire, Pjack, aieri, samuelallan72, jneo8, gabrielcocenza and sbparke December 13, 2024 03:26

Deezzir force-pushed the SOLENG-947 branch from 7a81aa5 to 57eedfe Compare December 13, 2024 03:30

Fix alert unit test

df76d77

Deezzir force-pushed the SOLENG-947 branch from 57eedfe to df76d77 Compare December 13, 2024 03:32

gabrielcocenza reviewed Dec 13, 2024

View reviewed changes

aieri approved these changes Dec 13, 2024

View reviewed changes

gabrielcocenza approved these changes Dec 13, 2024

View reviewed changes

Deezzir merged commit d7d6c32 into canonical:main Dec 13, 2024
10 checks passed

Deezzir deleted the SOLENG-947 branch December 13, 2024 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SmartAttributeWarning alert #375

Fix SmartAttributeWarning alert #375

Deezzir commented Dec 13, 2024 •

edited

Loading

gabrielcocenza left a comment

Deezzir commented Dec 13, 2024 •

edited

Loading

Deezzir commented Dec 13, 2024

gabrielcocenza commented Dec 13, 2024

aieri left a comment

Fix SmartAttributeWarning alert #375

Fix SmartAttributeWarning alert #375

Conversation

Deezzir commented Dec 13, 2024 • edited Loading

gabrielcocenza left a comment

Choose a reason for hiding this comment

Deezzir commented Dec 13, 2024 • edited Loading

Deezzir commented Dec 13, 2024

gabrielcocenza commented Dec 13, 2024

aieri left a comment

Choose a reason for hiding this comment

Deezzir commented Dec 13, 2024 •

edited

Loading

Deezzir commented Dec 13, 2024 •

edited

Loading