[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374

dcelasun · 2022-09-24T16:20:19Z

Describe the bug

This particular SMART attribute is expected to fluctuate up and down, especially during random IO, and is not indicative of disk failure. See here for some background info. Also, it seems that for this attribute lower values are worse, not better.

Expected behavior
Scrutiny shouldn't report this as failure. Seagate's own SeaTools doesn't either.

Screenshots
See the last row.

AnalogJ · 2022-09-24T17:58:21Z

thats interesting.

Technically this result is "correct" since the Backblaze data Scrutiny uses correlates your ECC Recovered failure value (40) with a 22% chance to fail.

The larger issue is that Scrutiny doesn't have the concept of transient failures. If any of the metrics have ever failed, then the disk will always be marked as failed (even if the ECC Recovered value resets).

This shouldn't be incredibly difficult to implement, but it may take some time.

Thanks for bringing this to my attention!

As discussed in [1] some SMART errors are transient and should not be treated as permanent. This commit adds support for a configurable list of ATA SMART attribute IDs for which failures will be treated as transient. Drive health history is still recorded and notifications are sent, but the device itself is not marked as failed. Fixes AnalogJ#374. [1] AnalogJ#374

dcelasun · 2022-09-24T20:26:57Z

Well, I took a shot at it, hope it's welcome :)

As discussed in [1] some SMART errors are transient and should not be treated as permanent. This commit adds support for a configurable list of ATA SMART attribute IDs, failures of which will be treated as transient. Drive health history is still recorded and notifications are sent, but the device itself is not marked as failed. Fixes AnalogJ#374. [1] AnalogJ#374

AnalogJ · 2022-10-13T03:40:21Z

Commented on your PR, sorry for the (incredibly long) delay!

korikori · 2023-01-23T16:03:15Z

Just wanted to chime in that I have a pair of similar Seagate drives (2TB) and this attribute for both gravitates around the 38-40 mark. I also see the 22% failure rate and "Failed" status which initially startled me.

Lebowski89 · 2024-04-11T03:54:48Z

Hello, I have Scrutiny installed on UnRaid (Docker compose). Installed it about a week or two ago, initially all my drives were listed as passed (even my 8 year power on drives). Today I have noticed that my Parity drive (Seagate BarraCuda Pro) was listed as failed. I checked critical values and it was all fine. Checked all values and it has listed a few warnings and a failure on hardware ECC recovered. Only thing I have done since the drive being listed as healthy was rebuild parity in UnRaid (converted some drives to ZFS, removed from drives from the array (into their own ZFS pool)). I've also installed and diskspeed and benchmarked the drive.

Should I be concerned? Or is this just Scrutiny being funky with Seagate drives?

N8-Yue · 2024-04-11T07:41:46Z

I have the same issue. Scrutiny shows higher and lower values with the Hardware ECC, but the raw value shows 0 errors ever recorded. This definitely needs to be a bug dedicated to Seagate, as they are one of the only ons to use this different raw value type. Hope this gets fixed, cause the drive is new, got tested thoroughly and the calculations show no single error ever recorded on it. Tool to calculate https://s.i.wtf

Lebowski89 · 2024-04-12T03:49:29Z

I have the same issue. Scrutiny shows higher and lower values with the Hardware ECC, but the raw value shows 0 errors ever recorded. This definitely needs to be a bug dedicated to Seagate, as they are one of the only ons to use this different raw value type. Hope this gets fixed, cause the drive is new, got tested thoroughly and the calculations show no single error ever recorded on it. Tool to calculate https://s.i.wtf

I did an extended SMART test on UnRaid and the drive passed with flying colors. I'm going to have to get rid of Scrutiny. I don't need that negativity in my life, especially when the drive is okay. I'll reinstall when they make changes to account for Seagates differences.

ST10000DM0004-20240412.txt

korikori · 2024-04-12T13:11:53Z

@Lebowski89 you could still use Scrutiny, but stick to SMART data only for the "Device Status - Thresholds" setting, as by default it uses SMART + the Backblaze dataset. Both of my Seagate disks are in "failed" status with the default settings, but they pass when I switch to SMART. ¯_(ツ)_/¯

brkr1 · 2024-05-13T20:23:56Z

you could still use Scrutiny, but stick to SMART

Will we still get notifications in case something changes?

Lebowski89 · 2024-11-24T11:09:40Z

@Lebowski89 you could still use Scrutiny, but stick to SMART data only for the "Device Status - Thresholds" setting, as by default it uses SMART + the Backblaze dataset. Both of my Seagate disks are in "failed" status with the default settings, but they pass when I switch to SMART. ¯_(ツ)_/¯

Re-deployed the Scrutiny container. Still reporting that Seagate drive as failed with Smart + Backblaze. Drive still easily passes smart and is working just fine. Followed your suggestion and changed the 'Device Status - Thresholds' to Smart and the drive is correctly listed as passed. Thanks.

However, I've noticed that doing so is now reporting a failing 850 Pro SSD as passed, even though it isn't passing the Smart test in UnRaid and is getting more reallocated sectors every day:

(Got a new SSD on the mail)

So yeah, there is that..

dcelasun added the bug Something isn't working label Sep 24, 2022

AnalogJ added the enhancement New feature or request label Sep 24, 2022

dcelasun linked a pull request Sep 24, 2022 that will close this issue

Support transient SMART failures #375

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374

[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374

dcelasun commented Sep 24, 2022

AnalogJ commented Sep 24, 2022 •

edited

Loading

dcelasun commented Sep 24, 2022

AnalogJ commented Oct 13, 2022

korikori commented Jan 23, 2023

Lebowski89 commented Apr 11, 2024

N8-Yue commented Apr 11, 2024 •

edited

Loading

Lebowski89 commented Apr 12, 2024

korikori commented Apr 12, 2024 •

edited

Loading

brkr1 commented May 13, 2024 •

edited

Loading

Lebowski89 commented Nov 24, 2024 •

edited

Loading

[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374

[BUG] Hardware ECC Recovered incorrectly reported as disk failure #374

Comments

dcelasun commented Sep 24, 2022

AnalogJ commented Sep 24, 2022 • edited Loading

dcelasun commented Sep 24, 2022

AnalogJ commented Oct 13, 2022

korikori commented Jan 23, 2023

Lebowski89 commented Apr 11, 2024

N8-Yue commented Apr 11, 2024 • edited Loading

Lebowski89 commented Apr 12, 2024

korikori commented Apr 12, 2024 • edited Loading

brkr1 commented May 13, 2024 • edited Loading

Lebowski89 commented Nov 24, 2024 • edited Loading

AnalogJ commented Sep 24, 2022 •

edited

Loading

N8-Yue commented Apr 11, 2024 •

edited

Loading

korikori commented Apr 12, 2024 •

edited

Loading

brkr1 commented May 13, 2024 •

edited

Loading

Lebowski89 commented Nov 24, 2024 •

edited

Loading