-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SmartAttributeWarning alert #375
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2c:
This is definitely an improvement and does not make sense to take other values than the raw
one.
However, I have the impression that this can still be noisy. Reading the article there are quite a lot false positives. I think we can be more strict and change the query to trigger a warn if more than one SMART attribute in a single device.
Moreover, as the article says:
The SMART stats we track, with the exception of SMART 197, are cumulative in nature, meaning we need to consider the time period over which the errors were reported.
I think we need to calculate considering by rate like more than one error in the last month, because if this is cumulative, once one error happen it will trigger the alert until the count is reset.
I was also thinking about increasing the window, but what would be a good value? Also, these errors are different in nature, so some of them might be more critical than others. I generally think that we should split them. Ideally, the alert should be more complex; from my experience, the errors, if you plot them, look like an exponential graph. If the drive actually starts to fail, it gradually increases the error count. Then, the errors bump more frequently closer to death. If we could do an alert that would check the change in error counts over some period of time, that would be better. But I am not sure if such tooling is available. |
One of possible improvements is to use the |
I would say that calculate the rate in a week seems reasonable.
What do you mean by that? If we have the count number we can calculate the rate in prometheus like: E.g:
This query gives you the average number of errors per day over the past week.
But aren't those cumulative counters? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is only a warning, I think it's ok to look at the raw value.
However I would not use it for criticals because the value is not normalized and thus its severity cannot be immediately derived
Only fire the alert if the
attribute_value_type="raw"
is set. The issue was in the fact that there are several entries for the attribute, which includesvalue
,worst
andthresh
:Coming from the
smartctl
CLI:Fixes: #358