Degraded pool, how to restore?

lela_tabathy

Dabbler
Joined
Nov 15, 2020
Messages
12
A recent power outage at my house (I know, UPS is already on order ...) resulted in my pool used for my Jails becoming degraded. It gives me the following error:

Code:
Pool Jailhouse state is DEGRADED: One or more devices has experienced an error resulting in data corruption. Applications may be affected. The following devices are not healthy: Disk 2967587216747608934 is DEGRADED


zpool status shows all kinds of corrupt files and recommends the following actions:

Code:
Restore the file in question if possible. Otherwise restore the entire pool from backup


How do I do that exactly? I have a backup of the entire pool on an external drive that is mounted on the system, with snapshots for the past two weeks.
Do I need to destroy the pool, then recreate it and then use zfs send to copy over the datasets from the backup drive? Is that the correct procedure? The statement of the disk being degraded confused me, does this mean the drive (nvme ssd) itself is broken?

Thanks for your help.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Do I need to destroy the pool, then recreate it and then use zfs send to copy over the datasets from the backup drive? Is that the correct procedure?
For restoring an entire pool - yes.

The statement of the disk being degraded confused me, does this mean the drive (nvme ssd) itself is broken?
Probably. smartctl -a <device> should tell you more about the disk.

If the disk is (probably) fine and just certain files damaged beyond repair due to the power outage, you could
  • scrub the pool
  • delete the files (zpool status -v will tell you which)
  • clear the pool status
  • restore just the files from one of the backup snapshots or
  • roll back to a snapshot before the outage for your pool/datasets if there still are any
 

lela_tabathy

Dabbler
Joined
Nov 15, 2020
Messages
12
Thanks, Patrick. This is the result of the drives smart test:
Looks ok to me?

There are a lot of corrupt files listed in the status report, I think it would actually be easier to just roll back entirely. I won't suffer any data loss, this is just a pool containing jails.

Code:
Use smartctl -h to get a usage summary

root@truenas[~]# smartctl -a /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       TS512GMTE110S
Serial Number:                      G310940044
Firmware Version:                   T0609B0L
PCI Vendor/Subsystem ID:            0x126f
IEEE OUI Identifier:                0x000000
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri May 14 21:21:22 2021 CEST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x0015):     Comp DS_Mngmt Sav/Sel_Feat
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    2,342,525 [1.19 TB]
Data Units Written:                 5,813,656 [2.97 TB]
Host Read Commands:                 24,478,753
Host Write Commands:                115,243,806
Controller Busy Time:               585
Power Cycles:                       22
Power On Hours:                     2,582
Unsafe Shutdowns:                   12
Media and Data Integrity Errors:    2,327
Error Information Log Entries:      2,327
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Code:
Media and Data Integrity Errors: 2,327
Error Information Log Entries: 2,327
That does not look good. Those entries should be zero or very close to zero. Compare with one of my SSDs - or your other drive.
Code:
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        44 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    4%
Data Units Read:                    7,738,103 [3.96 TB]
Data Units Written:                 99,629,206 [51.0 TB]
Host Read Commands:                 139,467,735
Host Write Commands:                1,828,236,715
Controller Busy Time:               30,949
Power Cycles:                       11
Power On Hours:                     6,430
Unsafe Shutdowns:                   7
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               44 Celsius
Temperature Sensor 2:               54 Celsius


51 TB written, 0 errors. I guess that drive is done. But it would be great if someone with more knowledge about this kind of hardware could confirm that. I wonder how a single power outage could have caused that - probably it didn't. I have had the usual couple of hard resets, too ...

HTH,
Patrick
 

lela_tabathy

Dabbler
Joined
Nov 15, 2020
Messages
12
Mhm, yeah it is strange. I did wonder: At the same time, I had an issue where the SSL certificate on a remote proxy I'm running on that drive didn't renew automatically and it left the server vulnerable for a few days until I noticed it. I don't know if its likely if someone got in? I lack the knowledge to really asses that though. I will get a new drive just in case.
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
I don't know if its likely if someone got in?
Because of an expired cert? If that was the only issue with your software, probably not. And even if somebody got in - those are hardware failures.
 

lela_tabathy

Dabbler
Joined
Nov 15, 2020
Messages
12
Yeah, I guess it doesn't make sense. As I said, I don't know too much about that stuff. But yea, I'll try the process you laid out just to see what happens and will get a new drive either way. Thank you!
 

lela_tabathy

Dabbler
Joined
Nov 15, 2020
Messages
12
One more question, most of the affected files seem to be in the directory / dataset "Jailouse/.system/
I can not access that directory through the shell, can you help me out on how to get in there so I can remove the files?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
Type mount <ENTER> and you get a nice map of datasets and corresponding directories.
 
Top