Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SmartAttributeWarning alert #375

Merged
merged 2 commits into from
Dec 13, 2024
Merged

Fix SmartAttributeWarning alert #375

merged 2 commits into from
Dec 13, 2024

Conversation

Deezzir
Copy link
Contributor

@Deezzir Deezzir commented Dec 13, 2024

Only fire the alert if the attribute_value_type="raw" is set. The issue was in the fact that there are several entries for the attribute, which includes value, worst and thresh:


smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="raw", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}
0
smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="thresh", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}
10
smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="value", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}
100
smartctl_device_attribute{attribute_flags_long="prefailure,updated_online,event_count,auto_keep", attribute_flags_short="PO--CK", attribute_id="5", attribute_name="Reallocated_Sector_Ct", attribute_value_type="worst", device="sda", instance="localhost:10201", job="hardware-observer_2_default", juju_application="hardware-observer", juju_model="hw-obs", juju_model_uuid="80a2c1d6-1a90-4ef0-8132-af698cf34f60", juju_unit="hardware-observer/1"}

Coming from the smartctl CLI:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

Fixes: #358

Copy link
Member

@gabrielcocenza gabrielcocenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c:

This is definitely an improvement and does not make sense to take other values than the raw one.

However, I have the impression that this can still be noisy. Reading the article there are quite a lot false positives. I think we can be more strict and change the query to trigger a warn if more than one SMART attribute in a single device.

Moreover, as the article says:

The SMART stats we track, with the exception of SMART 197, are cumulative in nature, meaning we need to consider the time period over which the errors were reported.

I think we need to calculate considering by rate like more than one error in the last month, because if this is cumulative, once one error happen it will trigger the alert until the count is reset.

@Deezzir
Copy link
Contributor Author

Deezzir commented Dec 13, 2024

I was also thinking about increasing the window, but what would be a good value? Also, these errors are different in nature, so some of them might be more critical than others. I generally think that we should split them.

Ideally, the alert should be more complex; from my experience, the errors, if you plot them, look like an exponential graph. If the drive actually starts to fail, it gradually increases the error count. Then, the errors bump more frequently closer to death.

If we could do an alert that would check the change in error counts over some period of time, that would be better. But I am not sure if such tooling is available.

@Deezzir
Copy link
Contributor Author

Deezzir commented Dec 13, 2024

One of possible improvements is to use the thresh values and check against them

@gabrielcocenza
Copy link
Member

I was also thinking about increasing the window, but what would be a good value?

I would say that calculate the rate in a week seems reasonable.

If we could do an alert that would check the change in error counts over some period of time, that would be better. But I am not sure if such tooling is available.

What do you mean by that? If we have the count number we can calculate the rate in prometheus like:

E.g:

rate(smartctl_device_attribute{attribute_id=~"5|187|188|197|198"}[1w]) * 86400

This query gives you the average number of errors per day over the past week.

One of possible improvements is to use the thresh values and check against them

But aren't those cumulative counters?

Copy link
Contributor

@aieri aieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is only a warning, I think it's ok to look at the raw value.
However I would not use it for criticals because the value is not normalized and thus its severity cannot be immediately derived

@Deezzir Deezzir merged commit d7d6c32 into canonical:main Dec 13, 2024
10 checks passed
@Deezzir Deezzir deleted the SOLENG-947 branch December 13, 2024 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SmartAttributeWarning seems to configured incorrectly
3 participants