Degraded Pool/Disk Error

ash...housewares · Apr 24, 2021

Hey all,

I'm hoping someone can help me out with a little problem I'm having...

The Problem
I received the following error earlier this week:

CRITICAL
Pool Backup state is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
The following devices are not healthy:

Disk 12971975576466214181 is DEGRADED

2021-04-18 00:01:23 (America/Chicago)

The Story to this Point
Here's a little backstory on what's happened up to this point. My wife got me a couple of 10TB WD Red drives for Christmas, so I set up this NAS from a PC that I was no longer using back in January sometime. I got TrueNAS installed, fumbled my way around creating a pool, and tested out some file transfers. Everything seemed like it was going to work out okay. Well, I powered it off and didn't get back to the project until just recently when the error above popped up.

The Setup

Motherboard: BIOSTAR TH67+
CPU: Intel Core i3-2100 Sandy Bridge
RAM: 8GB (2x 4GB)
Hard drives:
- 2x WDC WD101EFAX-68LDBN0 (10TB Western Digital Red) - These are set up in RAID 1
- 1x M4-CT064M4SSD2 (Cucial 64GB SSD Boot Drive)
- 1x WD Elements 2620 (5TB External drive)

What I've Done So Far
As you can probably tell, I'm a mega-n00b at this. I've tried to google my way through some trouble shooting, but I'm feeling kind of stuck now. Here's what I've tried up to this point:

On 4/21 I did a scrub on the pool which completed without errors.

I did a couple of fast S.M.A.R.T. scans followed by a long scan on each drive of the 10TB drives. Here are the results of that (Degraded drive first, followed by healthy drive for comparison):

Degraded Drive:

Code:

Warning: settings changed through the CLI are not written to
the configuration database and will be reset on reboot.

smartctl -a /dev/ada0
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD101EFAX-68LDBN0
Serial Number:    
LU WWN Device Id: 5 000cca 0b0ce44d7
Firmware Version: 81.00A81
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Apr 24 14:10:51 2021 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   87) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1130) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   127   127   054    Old_age   Offline      -       112
  3 Spin_Up_Time            0x0007   253   253   024    Pre-fail  Always       -       163 (Average 72)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       344
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       210
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       344
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       352
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       352
194 Temperature_Celsius     0x0002   138   138   000    Old_age   Always       -       47 (Min/Max 20/49)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       159         -
# 2  Short offline       Completed without error       00%       140         -
# 3  Extended offline    Interrupted (host reset)      20%       130         -
# 4  Short offline       Completed without error       00%       118         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing

Healthy Drive:

Code:

smartctl -a /dev/ada2
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD101EFAX-68LDBN0
Serial Number:    
LU WWN Device Id: 5 000cca 0b0cfd400
Firmware Version: 81.00A81
User Capacity:    10,000,831,348,736 bytes [10.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Apr 24 14:22:50 2021 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (   87) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1024) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   128   128   054    Old_age   Offline      -       108
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   067    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   128   128   020    Old_age   Offline      -       18
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       210
 10 Spin_Retry_Count        0x0012   100   100   060    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       16
194 Temperature_Celsius     0x0002   151   151   000    Old_age   Always       -       43 (Min/Max 20/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       157         -
# 2  Short offline       Completed without error       00%       141         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing

Final Thoughts
When comparing the two drives, it seems that spin-up time, start/stop count, power cycle count, power-off retract count, and load cycle count all differ pretty wildly. But I'll be honest here; I don't really know what I'm looking at.

Can someone with more expertise tell me if one of my drives has turned out to be a lemon or if I should just learn to stop worrying and clear the error?

Thanks for any help!

GBillR · Apr 24, 2021

Your drives look okay. What doesn't look okay is when you say "RAID 1". What do you mean exactly?

What is the status of the pool? Two ways to get this:

1. SSH or shell type zpool status
2. in the GUI (assuming you are in the new GUI with 12 since you left that out), go to Storage/Pools and click the gear icon in the upper right.. select status.

It sounds like a scrub found an error and corrected it. Your pool status should show a checksum error.

If you setup a RAID outside FreeNAS, who knows what exactly is happening....

ash...housewares · Apr 25, 2021

Hey GBillR,

Thanks so much for taking the time to help me out!

By RAID 1 I just meant that I have the drives mirrored in TrueNAS. Yes, I'm on version 12.0 of TrueNAS. Sorry, I knew I forgot something...

It looks like running zpool status returns a little more information than the GUI gives.

zpool status gives:

Code:

  pool: Backup
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:04:08 with 0 errors on Wed Apr 21 15:30:08 2021
config:

        NAME                                            STATE     READ WRITE CKSUM
        Backup                                          DEGRADED     0     0 0
          mirror-0                                      DEGRADED     0     0 0
            gptid/09cfaa87-6678-11eb-bcd5-003067c2f45e  ONLINE       0     0 0
            gptid/09da9c15-6678-11eb-bcd5-003067c2f45e  DEGRADED     0     0 0  too many errors

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:14 with 0 errors on Sun Apr 25 03:45:14 2021
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          ada1p2    ONLINE       0     0     0

errors: No known data errors

joeschmuck · Apr 25, 2021

If you haven't already done so, reboot your TrueNAS system and then run another Scrub. After that run the zpool status again. Hopefully all is looking good. Also upgrade to TrueNAS 12.0-U3 if you haven't already. In the future, post your system specs per the forum rules, it will help us provide faster and more accurate advice.

As previously stated, your hard drives look perfectly fine.

If the error is still present then run zpool clear Backup and take a look again.

ash...housewares · Apr 25, 2021

Thanks for the help on this.

My setup is as follows, but let me know if there's something not listed that you want to know about
The Setup

Motherboard: BIOSTAR TH67+
CPU: Intel Core i3-2100 Sandy Bridge
RAM: 8GB (2x 4GB)
Hard drives:
- 2x WDC WD101EFAX-68LDBN0 (10TB Western Digital Red) - These are set up in RAID 1
- 1x M4-CT064M4SSD2 (Cucial 64GB SSD Boot Drive)
- 1x WD Elements 2620 (5TB External drive)

I followed the directions: reboot, scrub, zpool status. The error was still present. I ran zpool clear Backup which cleared the error. I'm having trouble finding good documentation online; does the zpool clear <pool> command do more than just clear the errors on the pool?

The too many errors line seemed a bit concerning. Any idea what would have caused that error to pop up? Should I just be hyper vigilant about that drive going forward?

Thanks again for your time!

Redcoat · Apr 25, 2021

ash...housewares said:
I'm having trouble finding good documentation online; does the zpool clear <pool> command do more than just clear the errors on the pool?

See https://openzfs.github.io/openzfs-docs/man/8/zpool-clear.8.html

ash...housewares · Apr 25, 2021

Redcoat said:
See https://openzfs.github.io/openzfs-docs/man/8/zpool-clear.8.html

Perfect. Thank you!
So, in that case. Should I be concerned about that drive at all going forward, or was this just a common aberration?

Redcoat · Apr 25, 2021

Please follow @joeschmuck 's advice. I was just helping out (I hope).

GBillR · Apr 25, 2021

ash...housewares said:
Perfect. Thank you!
So, in that case. Should I be concerned about that drive at all going forward, or was this just a common aberration?

Your SMART data looks okay for those drives. In fact, I see no errors at all... which makes me wonder if you had a cable issue at some point.

You should be running regular SMART tests, both short and long. Additionally, you should be running regular scrubs. Since at least one scrub in the past found and corrected errors, I personally would run scrubs at least weekly for a couple of weeks to see if it throws another error. Also - one of your SMART tests was interrupted. Be sure to schedule the long tests when you know the NAS will be online long enough to complete. Based on the information in the beginning of your SMART data, the long test will take over 18 hours to complete.

If your system is on 24/7, I would run weekly short tests and monthly long tests. Don't run a long test at the same time as a scrub, and I recommend staggering the long tests so you are only testing one drive at a time. Once you have a few clean scrubs I would probably go to every 2 weeks, but that's just me.

Take a look here for some more reading:

https://www.truenas.com/community/r...bleshooting-guide-all-versions-of-freenas.17/ - Thanks for this one @joeschmuck

https://www.truenas.com/community/threads/forum-rules.45124/

joeschmuck · Apr 26, 2021

ash...housewares said:
So, in that case. Should I be concerned about that drive at all going forward, or was this just a common aberration?

For this problem, the drives are fine, that is what the data says. If I saw something in the data that warranted examining further, even if it was unrelated to this problem, I would have pointed it out and generally provided steps to troubleshoot the problem.

ash...housewares said:
RAM: 8GB (2x 4GB)

Your RAM is on the low side so be aware that if you try to run a VM or several, you might run into excessive SWAP file usage. Ideally you want no SWAP usage ever.

ash...housewares said:
I ran zpool clear Backup which cleared the error.

Glad that worked, I didn't think rebooting would fix it but I always prefer to try the simple things first.

ash...housewares · Apr 26, 2021

Thank you all so much for your replies and assistance. I'll set up some scheduled scrubs and S.M.A.R.T. scans as suggested as well as work my way through that Hard Drive Troubleshooting Guide for better education going forward.

Thanks again for all the help!

Important Announcement for the TrueNAS Community.

Degraded Pool/Disk Error

ash...housewares

Cadet

GBillR

Contributor

ash...housewares

Cadet

joeschmuck

Old Man

ash...housewares

Cadet

Redcoat

MVP

ash...housewares

Cadet

Redcoat

MVP

GBillR

Contributor

joeschmuck

Old Man

ash...housewares

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

Degraded Pool/Disk Error

Cadet

Contributor

Cadet

Old Man

Cadet

MVP

Cadet

MVP

Contributor

Old Man

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Degraded Pool/Disk Error"

Similar threads