Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374

Open
dcelasun opened this issue Sep 24, 2022 · 10 comments · May be fixed by #375
Open

[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374

dcelasun opened this issue Sep 24, 2022 · 10 comments · May be fixed by #375
Labels
bug Something isn't working enhancement New feature or request

Comments

@dcelasun
Copy link

Describe the bug

This particular SMART attribute is expected to fluctuate up and down, especially during random IO, and is not indicative of disk failure. See here for some background info. Also, it seems that for this attribute lower values are worse, not better.

Expected behavior
Scrutiny shouldn't report this as failure. Seagate's own SeaTools doesn't either.

Screenshots
See the last row.

Screenshot from 2022-09-24 17-13-58

@dcelasun dcelasun added the bug Something isn't working label Sep 24, 2022
@AnalogJ
Copy link
Owner

AnalogJ commented Sep 24, 2022

thats interesting.

Technically this result is "correct" since the Backblaze data Scrutiny uses correlates your ECC Recovered failure value (40) with a 22% chance to fail.

The larger issue is that Scrutiny doesn't have the concept of transient failures. If any of the metrics have ever failed, then the disk will always be marked as failed (even if the ECC Recovered value resets).

This shouldn't be incredibly difficult to implement, but it may take some time.

Thanks for bringing this to my attention!

@AnalogJ AnalogJ added the enhancement New feature or request label Sep 24, 2022
dcelasun added a commit to dcelasun/scrutiny that referenced this issue Sep 24, 2022
As discussed in [1] some SMART errors are transient and should not
be treated as permanent.

This commit adds support for a configurable list of ATA SMART attribute
IDs for which failures will be treated as transient. Drive health history
is still recorded and notifications are sent, but the device itself is
not marked as failed.

Fixes AnalogJ#374.

[1] AnalogJ#374
@dcelasun dcelasun linked a pull request Sep 24, 2022 that will close this issue
@dcelasun
Copy link
Author

Well, I took a shot at it, hope it's welcome :)

dcelasun added a commit to dcelasun/scrutiny that referenced this issue Sep 24, 2022
As discussed in [1] some SMART errors are transient and should not
be treated as permanent.

This commit adds support for a configurable list of ATA SMART attribute
IDs, failures of which will be treated as transient. Drive health history
is still recorded and notifications are sent, but the device itself is
not marked as failed.

Fixes AnalogJ#374.

[1] AnalogJ#374
@AnalogJ
Copy link
Owner

AnalogJ commented Oct 13, 2022

Commented on your PR, sorry for the (incredibly long) delay!

@korikori
Copy link

Just wanted to chime in that I have a pair of similar Seagate drives (2TB) and this attribute for both gravitates around the 38-40 mark. I also see the 22% failure rate and "Failed" status which initially startled me.

@Lebowski89
Copy link

Hello, I have Scrutiny installed on UnRaid (Docker compose). Installed it about a week or two ago, initially all my drives were listed as passed (even my 8 year power on drives). Today I have noticed that my Parity drive (Seagate BarraCuda Pro) was listed as failed. I checked critical values and it was all fine. Checked all values and it has listed a few warnings and a failure on hardware ECC recovered. Only thing I have done since the drive being listed as healthy was rebuild parity in UnRaid (converted some drives to ZFS, removed from drives from the array (into their own ZFS pool)). I've also installed and diskspeed and benchmarked the drive.

Parity1
Parity2
Parity3
Parity4
Parity5

Should I be concerned? Or is this just Scrutiny being funky with Seagate drives?

@N8-Yue
Copy link

N8-Yue commented Apr 11, 2024

I have the same issue. Scrutiny shows higher and lower values with the Hardware ECC, but the raw value shows 0 errors ever recorded. This definitely needs to be a bug dedicated to Seagate, as they are one of the only ons to use this different raw value type. Hope this gets fixed, cause the drive is new, got tested thoroughly and the calculations show no single error ever recorded on it. Tool to calculate https://s.i.wtf

@Lebowski89
Copy link

I have the same issue. Scrutiny shows higher and lower values with the Hardware ECC, but the raw value shows 0 errors ever recorded. This definitely needs to be a bug dedicated to Seagate, as they are one of the only ons to use this different raw value type. Hope this gets fixed, cause the drive is new, got tested thoroughly and the calculations show no single error ever recorded on it. Tool to calculate https://s.i.wtf

I did an extended SMART test on UnRaid and the drive passed with flying colors. I'm going to have to get rid of Scrutiny. I don't need that negativity in my life, especially when the drive is okay. I'll reinstall when they make changes to account for Seagates differences.

ST10000DM0004-20240412.txt

@korikori
Copy link

korikori commented Apr 12, 2024

@Lebowski89 you could still use Scrutiny, but stick to SMART data only for the "Device Status - Thresholds" setting, as by default it uses SMART + the Backblaze dataset. Both of my Seagate disks are in "failed" status with the default settings, but they pass when I switch to SMART. ¯_(ツ)_/¯

@brkr1
Copy link

brkr1 commented May 13, 2024

you could still use Scrutiny, but stick to SMART

Will we still get notifications in case something changes?

@Lebowski89
Copy link

Lebowski89 commented Nov 24, 2024

@Lebowski89 you could still use Scrutiny, but stick to SMART data only for the "Device Status - Thresholds" setting, as by default it uses SMART + the Backblaze dataset. Both of my Seagate disks are in "failed" status with the default settings, but they pass when I switch to SMART. ¯_(ツ)_/¯

Re-deployed the Scrutiny container. Still reporting that Seagate drive as failed with Smart + Backblaze. Drive still easily passes smart and is working just fine. Followed your suggestion and changed the 'Device Status - Thresholds' to Smart and the drive is correctly listed as passed. Thanks.

Screenshot 2024-11-24 221453

However, I've noticed that doing so is now reporting a failing 850 Pro SSD as passed, even though it isn't passing the Smart test in UnRaid and is getting more reallocated sectors every day:

Screenshot 2024-11-24 221203

(Got a new SSD on the mail)

So yeah, there is that..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants