-
-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374
Comments
thats interesting. Technically this result is "correct" since the Backblaze data Scrutiny uses correlates your ECC Recovered failure value (40) with a 22% chance to fail. The larger issue is that Scrutiny doesn't have the concept of transient failures. If any of the metrics have ever failed, then the disk will always be marked as failed (even if the ECC Recovered value resets). This shouldn't be incredibly difficult to implement, but it may take some time. Thanks for bringing this to my attention! |
As discussed in [1] some SMART errors are transient and should not be treated as permanent. This commit adds support for a configurable list of ATA SMART attribute IDs for which failures will be treated as transient. Drive health history is still recorded and notifications are sent, but the device itself is not marked as failed. Fixes AnalogJ#374. [1] AnalogJ#374
Well, I took a shot at it, hope it's welcome :) |
As discussed in [1] some SMART errors are transient and should not be treated as permanent. This commit adds support for a configurable list of ATA SMART attribute IDs, failures of which will be treated as transient. Drive health history is still recorded and notifications are sent, but the device itself is not marked as failed. Fixes AnalogJ#374. [1] AnalogJ#374
Commented on your PR, sorry for the (incredibly long) delay! |
Just wanted to chime in that I have a pair of similar Seagate drives (2TB) and this attribute for both gravitates around the 38-40 mark. I also see the 22% failure rate and "Failed" status which initially startled me. |
Hello, I have Scrutiny installed on UnRaid (Docker compose). Installed it about a week or two ago, initially all my drives were listed as passed (even my 8 year power on drives). Today I have noticed that my Parity drive (Seagate BarraCuda Pro) was listed as failed. I checked critical values and it was all fine. Checked all values and it has listed a few warnings and a failure on hardware ECC recovered. Only thing I have done since the drive being listed as healthy was rebuild parity in UnRaid (converted some drives to ZFS, removed from drives from the array (into their own ZFS pool)). I've also installed and diskspeed and benchmarked the drive. Should I be concerned? Or is this just Scrutiny being funky with Seagate drives? |
I have the same issue. Scrutiny shows higher and lower values with the Hardware ECC, but the raw value shows 0 errors ever recorded. This definitely needs to be a bug dedicated to Seagate, as they are one of the only ons to use this different raw value type. Hope this gets fixed, cause the drive is new, got tested thoroughly and the calculations show no single error ever recorded on it. Tool to calculate https://s.i.wtf |
I did an extended SMART test on UnRaid and the drive passed with flying colors. I'm going to have to get rid of Scrutiny. I don't need that negativity in my life, especially when the drive is okay. I'll reinstall when they make changes to account for Seagates differences. |
@Lebowski89 you could still use Scrutiny, but stick to SMART data only for the "Device Status - Thresholds" setting, as by default it uses SMART + the Backblaze dataset. Both of my Seagate disks are in "failed" status with the default settings, but they pass when I switch to SMART. ¯_(ツ)_/¯ |
Will we still get notifications in case something changes? |
Re-deployed the Scrutiny container. Still reporting that Seagate drive as failed with Smart + Backblaze. Drive still easily passes smart and is working just fine. Followed your suggestion and changed the 'Device Status - Thresholds' to Smart and the drive is correctly listed as passed. Thanks. However, I've noticed that doing so is now reporting a failing 850 Pro SSD as passed, even though it isn't passing the Smart test in UnRaid and is getting more reallocated sectors every day: (Got a new SSD on the mail) So yeah, there is that.. |
Describe the bug
This particular SMART attribute is expected to fluctuate up and down, especially during random IO, and is not indicative of disk failure. See here for some background info. Also, it seems that for this attribute lower values are worse, not better.
Expected behavior
Scrutiny shouldn't report this as failure. Seagate's own SeaTools doesn't either.
Screenshots
See the last row.
The text was updated successfully, but these errors were encountered: