HDD degraded but passing all tests.

Grinas · Mar 23, 2022

So i noticed the other day that one of my pools was degraded.

I then went and checked if it was in warranty period and it is. Its a 3TB WD green.

When i went to put in the request to return it. I was asked for why so did some tests on the drive and all came back good. If the drive is passing all tests then why is it showing as degraded?

the tests i did were
smartctl -l selftest /dev/da4 -t long
smartctl -l selftest /dev/da4 -t conveyance
smartctl -l selftest /dev/da4 -t short

any advise here on how to prove its failing/degraded. The drive is the newest of the 8 i have so it was a surprised when it was the one that was failing but i had drives that were brand new before that failed within a week so its I guess age makes no difference.

mistermanko · Mar 23, 2022

Please post the result of smartctl -a /dev/da4 here in CODE-brackets.

Grinas · Mar 23, 2022

Code:

smartctl -a /dev/da4
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Purple
Device Model:     WDC WD30PURZ-85GU6Y0
Serial Number:    WD-WCC4N7AY14Z9
LU WWN Device Id: 5 0014ee 265d1adc5
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Mar 23 14:16:30 2022 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (39600) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 398) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   187   182   021    Pre-fail  Always       -       5641
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       99
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       17179
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       99
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       97
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       381
194 Temperature_Celsius     0x0022   107   092   000    Old_age   Always       -       43
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     17179         -
# 2  Extended offline    Completed without error       00%     17137         -
# 3  Conveyance offline  Completed without error       00%     17129         -
# 4  Extended offline    Aborted by host               90%     17129         -
# 5  Short offline       Completed without error       00%     17129         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

ah i think i see it there now.

Code:

  3 Spin_Up_Time            0x0027   187   182   021    Pre-fail  Always       -       5641

joeschmuck · Mar 23, 2022

I cannot open your images in your first post so I'm flying a bit blind.

Drive Serial Number: WD-WCC4N7AY14Z9 looks good, no issues noted at all, and the Spin_Up_Time is fine. You should do a Google Search for "SMART Codes Wiki" and read up on what each code means to help you in the future.

I'm not a fan of using a Surveillance drive as a NAS drive but others have said it does work.

mistermanko · Mar 23, 2022

No, the disk looks fine judging by SMART data. Please post the output of zpool status here.

Grinas · Mar 23, 2022

mistermanko said:
No, the disk looks fine judging by SMART data. Please post the output of zpool status here.

Thats what i thought too but it only shows 1 drive as being degraded and as @joeschmuck said it looks fine.

Code:

zpool status
  pool: ThreeTBM
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 115G in 01:02:01 with 0 errors on Wed Mar 16 08:57:48 2022
config:

    NAME                                            STATE     READ WRITE CKSUM
    ThreeTBM                                        DEGRADED     0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/0486c9b0-7721-11e8-aaf6-1866da124f27  ONLINE       0     0     0
        gptid/bb75f4de-56a9-11e8-8d38-1866da124f27  ONLINE       0     0     0
        gptid/bdeb9521-56a9-11e8-8d38-1866da124f27  ONLINE       0     0     0
      raidz1-1                                      DEGRADED     0     0     0
        gptid/cc7658af-baa2-11ea-853a-000c296d4317  ONLINE       0     0     0
        gptid/cccb7648-baa2-11ea-853a-000c296d4317  DEGRADED     0     0     0  too many errors
        gptid/cce4a09d-baa2-11ea-853a-000c296d4317  ONLINE       0     0     0

errors: No known data errors

Etorix · Mar 23, 2022

Surveillance is "Purple"; "Green" is non-performance desktop. I'd guess that the disk encountered a temporary issue reading data, retried as long as it needed to succeed and was kicked out of the pool by ZFS for failing to answer in a timely manner.

Checking for TLER, ERC, etc. support on a drive

One of the problems with consumer-grade hard drives is that most of them will hang in the event that they run into an error, and will internally retry the operation, possibly for a minute or more. For a desktop PC, where redundancy does not exist, this is the correct course of action, because...

www.truenas.com

joeschmuck · Mar 23, 2022

Do a scrub of your pool, should clear it up. If it keeps happening then more investigating is needed.

EDIT: Do not do any funny commands without someone giving you direction to do them unless you understand what you are doing. If after a scrub you still have the issue, you can run the clear command but do the scrub first.

mistermanko · Mar 23, 2022

joe was faster...

joeschmuck · Mar 23, 2022

mistermanko said:
joe was faster...

I was just bored at work. Today has been unusually slow.

mistermanko · Mar 24, 2022

joeschmuck said:
I was just bored at work. Today has been unusually slow.

Been there, done that.

Grinas · Mar 28, 2022

joeschmuck said:
Do a scrub of your pool, should clear it up. If it keeps happening then more investigating is needed.

EDIT: Do not do any funny commands without someone giving you direction to do them unless you understand what you are doing. If after a scrub you still have the issue, you can run the clear command but do the scrub first.

I have done a few scrubs from the web UI but still the same. Do i need to keep the webUI open when a scrub is running as i am not even sure they are finishing.

is it as simple as

Code:

zpool scrub ThreeTBM

to do it from the commandline in the tmux session?

danb35 · Mar 28, 2022

Yes, zpool scrub ThreeTBM would do the trick, as would starting it through the GUI. In neither case do you need to keep the session open.

winnielinnie · Mar 28, 2022

Grinas said:
to do it from the commandline in the tmux session?

No need for tmux or disowning. Unless you specifically invoke the "-w" flag, a zpool scrub will always run in the background.

joeschmuck · Mar 29, 2022

Your last scrub report was 16 March so if you still have the problem after the new scrub, post a new "zpool status" and "glabel status" output in code brackets.

I would run a SMART Long/Extended test on all of your drives, do not pick and choose, just do them all "smartctl -t long /dev/da0" and all the da drives. You can run from the shell and close it after the commands are sent, but you cannot shut down or sleep the computer or the drives will not complete the test. And the system will remain operational, maybe slightly slower but still operational. Why run it on all the drives, just in case you have a problem on a drive you have not indicated, and it doesn't hurt.

Last thing, while I'm not sure how you got to drive da4 (I suspect the first posting had a glabel status image) but if this is still the same drive causing the issue, and ensure you write down the drive serial number, always track using the serial number because the drive identifier of da4 can change and you will see that a lot on TrueNAS Scale (Debian).

Grinas · Mar 31, 2022

Still the same after a scub
Here is the ouput of zpool status and glabel status.

Code:

zpool status
  pool: ThreeTBM
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 10:45:00 with 0 errors on Wed Mar 30 07:28:03 2022
config:

    NAME                                            STATE     READ WRITE CKSUM
    ThreeTBM                                        DEGRADED     0     0     0
      raidz1-0                                      ONLINE       0     0     0
        gptid/0486c9b0-7721-11e8-aaf6-1866da124f27  ONLINE       0     0     0
        gptid/bb75f4de-56a9-11e8-8d38-1866da124f27  ONLINE       0     0     0
        gptid/bdeb9521-56a9-11e8-8d38-1866da124f27  ONLINE       0     0     0
      raidz1-1                                      DEGRADED     0     0     0
        gptid/cc7658af-baa2-11ea-853a-000c296d4317  ONLINE       0     0     0
        gptid/cccb7648-baa2-11ea-853a-000c296d4317  DEGRADED     0     0     0  too many errors
        gptid/cce4a09d-baa2-11ea-853a-000c296d4317  ONLINE       0     0     0

errors: No known data errors

Code:

glabel status
                                      Name  Status  Components
gptid/8e1f70c3-cf3b-11eb-9b9c-000c292d8980     N/A  da0p1
gptid/0486c9b0-7721-11e8-aaf6-1866da124f27     N/A  da1p2
gptid/bdeb9521-56a9-11e8-8d38-1866da124f27     N/A  da2p2
gptid/bb75f4de-56a9-11e8-8d38-1866da124f27     N/A  da3p2
gptid/cccb7648-baa2-11ea-853a-000c296d4317     N/A  da4p2
gptid/674157ba-a638-11ea-9c42-000c296d4317     N/A  da5p2
gptid/cce4a09d-baa2-11ea-853a-000c296d4317     N/A  da6p2
gptid/cc7658af-baa2-11ea-853a-000c296d4317     N/A  da7p2
gptid/1de2fa9b-a91a-11ec-8be9-000c290bcc0f     N/A  da8p2

joeschmuck said:
Last thing, while I'm not sure how you got to drive da4 (I suspect the first posting had a glabel status image) but if this is still the same drive causing the issue, and ensure you write down the drive serial number, always track using the serial number because the drive identifier of da4 can change and you will see that a lot on TrueNAS Scale (Debian).

I did that i took down the serial number shut down the machine and opened it up to see which drive it was and it was a 3TB purple drive the newest drive in the lot.

Ill do a test on all drives and post the results as soon as it done.

Thanks for the help all.

***UPDATE***
Are you sure this is the correct command as i get an error when trying to run it. I dont see an option from the help of smartctl to run a test on all drives at once only on individual drives like i have done already.

Code:

smartctl -t long /dev/da0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Long (extended) offline self test failed [unsupported field in scsi command]

winnielinnie · Mar 31, 2022

Is it because of this?

Raid Card: LSI 9217-8i

Grinas · Mar 31, 2022

winnielinnie said:
Is it because of this?

you think its an issue with the RAID card?

winnielinnie · Mar 31, 2022

Grinas said:
you think its an issue with the RAID card?

It's likely.

Someone else with a similar HBA was unable to run SMART tests on their drives.

Long SMART self test failed [unsupported field in scsi command]

I've had this server for a couple of months and its working fine but from the beginning I've gotten the above error on all smartctl long tests. I've tried everything I could think of, and tried some of the different options on the smartmontools wiki. All drives have this error, short tests...

www.truenas.com

joeschmuck · Mar 31, 2022

The RAID comment was for being able to run the SMART test.

At this point I'd tell you to conduct a 'zpool clear ThreeTBM' command and then check you zpool status again.
If that fails... I would (and this is just me) backup all my data, destroy my pool and recreate it, then perform a scrub on the empty pool, check the zpool status, and if all is good, restore all my data. Last step, cross your fingers nothing else comes up.

And it definitely looks like drive da4 is your problem but it may not be the drive, it could be a data cable, the controller, or maybe a one time glitch. If the problem comes back, swap the drive da4 with a different drive in the raidz1-0 set (after backing up your data) and see if the problem moves. Sometimes these problems will take a bit of troubleshooting to figure out what is going on.

Best of luck.

EDIT: Of course after posting I started thinking something... What I find as odd is your drive da4 is running SMART Extended (aka. Long) tests, so instead of running the test via the command line, run it via the GUI. It may work.

Important Announcement for the TrueNAS Community.

HDD degraded but passing all tests.

Contributor

Guru

Contributor

Old Man

Guru

Contributor

Wizard

Old Man

Guru

Old Man

Guru

Contributor

Hall of Famer

MVP

Old Man

Contributor

MVP

Contributor

MVP

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HDD degraded but passing all tests."

Similar threads