Degraded pool, how to restore?

lela_tabathy · May 14, 2021

A recent power outage at my house (I know, UPS is already on order ...) resulted in my pool used for my Jails becoming degraded. It gives me the following error:

Code:

Pool Jailhouse state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected. The following devices are not healthy: Disk 2967587216747608934 is DEGRADED

zpool status shows all kinds of corrupt files and recommends the following actions:

Code:

Restore the file in question if possible. Otherwise restore the entire pool from backup

How do I do that exactly? I have a backup of the entire pool on an external drive that is mounted on the system, with snapshots for the past two weeks.
Do I need to destroy the pool, then recreate it and then use zfs send to copy over the datasets from the backup drive? Is that the correct procedure? The statement of the disk being degraded confused me, does this mean the drive (nvme ssd) itself is broken?

Thanks for your help.

Patrick M. Hausen · May 14, 2021

lela_tabathy said:
Do I need to destroy the pool, then recreate it and then use zfs send to copy over the datasets from the backup drive? Is that the correct procedure?

For restoring an entire pool - yes.

lela_tabathy said:
The statement of the disk being degraded confused me, does this mean the drive (nvme ssd) itself is broken?

Probably. smartctl -a <device> should tell you more about the disk.

If the disk is (probably) fine and just certain files damaged beyond repair due to the power outage, you could

scrub the pool
delete the files (zpool status -v will tell you which)
clear the pool status
restore just the files from one of the backup snapshots or
roll back to a snapshot before the outage for your pool/datasets if there still are any

lela_tabathy · May 14, 2021

Thanks, Patrick. This is the result of the drives smart test:
Looks ok to me?

There are a lot of corrupt files listed in the status report, I think it would actually be easier to just roll back entirely. I won't suffer any data loss, this is just a pool containing jails.

Code:

Use smartctl -h to get a usage summary

root@truenas[~]# smartctl -a /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       TS512GMTE110S
Serial Number:                      G310940044
Firmware Version:                   T0609B0L
PCI Vendor/Subsystem ID:            0x126f
IEEE OUI Identifier:                0x000000
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri May 14 21:21:22 2021 CEST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0015):     Comp DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    2,342,525 [1.19 TB]
Data Units Written:                 5,813,656 [2.97 TB]
Host Read Commands:                 24,478,753
Host Write Commands:                115,243,806
Controller Busy Time:               585
Power Cycles:                       22
Power On Hours:                     2,582
Unsafe Shutdowns:                   12
Media and Data Integrity Errors:    2,327
Error Information Log Entries:      2,327
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Patrick M. Hausen · May 14, 2021

lela_tabathy said:
Code:
Media and Data Integrity Errors: 2,327 Error Information Log Entries: 2,327

That does not look good. Those entries should be zero or very close to zero. Compare with one of my SSDs - or your other drive.

Code:

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    7,738,103 [3.96 TB]
Data Units Written:                 99,629,206 [51.0 TB]
Host Read Commands:                 139,467,735
Host Write Commands:                1,828,236,715
Controller Busy Time:               30,949
Power Cycles:                       11
Power On Hours:                     6,430
Unsafe Shutdowns:                   7
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               54 Celsius

51 TB written, 0 errors. I guess that drive is done. But it would be great if someone with more knowledge about this kind of hardware could confirm that. I wonder how a single power outage could have caused that - probably it didn't. I have had the usual couple of hard resets, too ...

HTH,
Patrick

lela_tabathy · May 14, 2021

Mhm, yeah it is strange. I did wonder: At the same time, I had an issue where the SSL certificate on a remote proxy I'm running on that drive didn't renew automatically and it left the server vulnerable for a few days until I noticed it. I don't know if its likely if someone got in? I lack the knowledge to really asses that though. I will get a new drive just in case.

Patrick M. Hausen · May 14, 2021

lela_tabathy said:
I don't know if its likely if someone got in?

Because of an expired cert? If that was the only issue with your software, probably not. And even if somebody got in - those are hardware failures.

lela_tabathy · May 14, 2021

Yeah, I guess it doesn't make sense. As I said, I don't know too much about that stuff. But yea, I'll try the process you laid out just to see what happens and will get a new drive either way. Thank you!

lela_tabathy · May 14, 2021

One more question, most of the affected files seem to be in the directory / dataset "Jailouse/.system/
I can not access that directory through the shell, can you help me out on how to get in there so I can remove the files?

Patrick M. Hausen · May 14, 2021

Type mount <ENTER> and you get a nice map of datasets and corresponding directories.

lela_tabathy · May 14, 2021

Ah, thanks.

Important Announcement for the TrueNAS Community.

Degraded pool, how to restore?

lela_tabathy

Dabbler

Patrick M. Hausen

Hall of Famer

lela_tabathy

Dabbler

Patrick M. Hausen

Hall of Famer

lela_tabathy

Dabbler

Patrick M. Hausen

Hall of Famer

lela_tabathy

Dabbler

lela_tabathy

Dabbler

Patrick M. Hausen

Hall of Famer

lela_tabathy

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Degraded pool, how to restore?

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Hall of Famer

Dabbler

Dabbler

Hall of Famer

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Degraded pool, how to restore?"

Similar threads