Pool is degraded, doesn't tell me why

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
I'm new to TrueNAS, I did use the search box and didn't find the answer. My TrueNAS build has 5x6TB WD RED EFRX series drives in affected pool (pool 1). This machine is also running another pool (pool 2) with 4x3TB WD RED EFRX series drives - this pool is not affected. I have set up TrueNAS pools and started transferring some data onto them. This has been running for last 48 hours. Today I was toying around with nextcloud plugin and I have accidentally noticed that pool 1 is degraded.
  • TrueNAS didn't give me any notification or indication that there's an issue. I find this really weird. Degraded pool seems very serious to me, I don't understand why the UI isn't giving me any visible clues that I need to drop everything and focus on keeping the data safe. The dashboard screen shows a little label that is easily missed, but at least it's showing something. However, dashboard isn't what I keep opened in my browser. I'm looking at plugins page and nothing in here is telling me "oh by the way your pool is degraded".
  • When I go into Storage / Pools page, pool 1 is showing yellow triangle with label "DEGRADED". No explanation as to why.
  • If I click on the little wheel icon and select "Status", it gives me table of drives with one of them having Status of "FAULTED".
  • The table on this page is weird. It's showing all zeros in all columns, except the affected drive which as "Write" value of 112. No idea what this means.
  • Three dot menu for the affected drive is giving me 4 options: Edit, Offline, Online, Replace. I can keep clicking "Online" and nothing changes. It spins the activity indicator for a while and then doesn't even give me a message whether the operation has succeeded or not. I don't understand why these options are all there or what they do, really. Other options I haven't yet tried.
I haven't found any clear documentation on how to proceed in situation like this. I'm using RAIDZ2 configuration, so the pool should still have redundancy and should be fine, but there's nothing in the UI telling me as much. Again, I don't find this UI very intuitive or functional.
  1. Can I find out what is wrong with the FAULTED drive, that is see why it was faulted?
  2. It would appear to me that the course of action here is to select the "Replace" option here and replace the faulted drive. Now, do I need to click "Offline" first or why is that option even there and what does it do?
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
2021-08-12_195610.png

Here is the mentioned pool status table. I don't really get what I'm looking at here. The pool has 1.5TB of data on it, so I don't understand why Write column is showing zeros.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Write column is showing zeros
Zero write errors. Or 112 write errors for ada2.

Can I find out what is wrong with the FAULTED drive, that is see why it was faulted?
Write errors, as per above.

It would appear to me that the course of action here is to select the "Replace" option here and replace the faulted drive. Now, do I need to click "Offline" first or why is that option even there and what does it do?
As per the manual, you'll want to offline the disk first. Unless you have a spare bay, in which case, go ahead with the replacement before offlining the bad disk.
 

Hellione

Explorer
Joined
Jan 23, 2021
Messages
55
-TrueNAS didn't give me any notification
#Did you setup e-mail notofication? If not, you won´t get a notification. If yes, check your config.
-When I go into Storage / Pools page, pool 1 is showing yellow triangle with label "DEGRADED". No explanation as to why.
If I click on the little wheel icon and select "Status", it gives me table of drives with one of them having Status of "FAULTED".
# "of them having Status of "FAULTED"" is the explanation for "DEGRADED" pool, what more info do you want/expect?
-Can I find out what is wrong with the FAULTED drive, that is see why it was faulted?
#Yes, write errors, but it is not necessary to know what exactly. You know it is almost dead, just change it.
@eric already answered how to replace it.
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
I have opted for the "Offline" option. The activity indicator rolled for a bit and disappeared. No message whether it was successful or not, or whether anything is happening. The disk still says "Degraded" and it appears like nothing is happening. If offline means what I think it means (move data off, remove from pool, resize everything so that the pool is fully operational with 4 drives instead of 5) then I guess this will take several hours. I can't find any way to see the progress except visit the disk page every once in a while to see if the status have changed from Degraded to Offline I guess.
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
# "of them having Status of "FAULTED"" is the explanation for "DEGRADED" pool, what more info do you want/expect?

I expected a big massive message written all over the page telling me the NAS has serious issue. Like every other NAS UI I've ever seen or any other system that works with disk arrays.

-Can I find out what is wrong with the FAULTED drive, that is see why it was faulted?
#Yes, write errors, but it is not necessary to know what exactly. You know it is almost dead, just change it.

I would kindly request that I decide what is necessary to know. Most of the times I receive write issues on my other towers/servers it's the SATA cable. This is actually visible in SMART as Interface CRC Errors, which sometimes could be presented as write issues. The SMART thing seems broken on TrueNAS. When I go into Storage / Disks, I select the affected disk and click "SMART Test Results", it tells me "No test results were found." I don't know how to retrieve the SMART data on TrueNAS. On Windows I usually run HDTune to view SMART data. I'll do some quick search on whether there's a way to see this in TrueNAS.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I expected a big massive message written all over the page telling me the NAS has serious issue. Like every other NAS UI I've ever seen or any other system that works with disk arrays.

Except this isn't really a serious issue. It's a common thing for hard drives to go bad. The NAS lets you know, if you've configured notifications. You haven't lost data or even redundancy. Systems with hundreds of disks develop new problem disks every few weeks, and it isn't an earth-shattering crisis when it happens. ZFS expects for this to happen, and is very careful to make sure it is handled correctly.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
When I go into Storage / Disks, I select the affected disk and click "SMART Test Results", it tells me "No test results were found." I don't know how to retrieve the SMART data on TrueNAS.
Did you turn on SMART tests in Services and then for each individual HDD (Storage>Disks)?
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
I have opted for the "Offline" option. The activity indicator rolled for a bit and disappeared. No message whether it was successful or not, or whether anything is happening. The disk still says "Degraded" and it appears like nothing is happening. If offline means what I think it means (move data off, remove from pool, resize everything so that the pool is fully operational with 4 drives instead of 5) then I guess this will take several hours. I can't find any way to see the progress except visit the disk page every once in a while to see if the status have changed from Degraded to Offline I guess.

After almost 2 days, the disk still says Faulted and if I try to click "Offline" option in the three dost menu, I get a little bit of activity indicator spinning and then nothing happens. Not sure how to continue putting the disk offline.
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
Except this isn't really a serious issue. It's a common thing for hard drives to go bad. The NAS lets you know, if you've configured notifications. You haven't lost data or even redundancy. Systems with hundreds of disks develop new problem disks every few weeks, and it isn't an earth-shattering crisis when it happens. ZFS expects for this to happen, and is very careful to make sure it is handled correctly.

Thanks for the explanation, I read more about ZFS and I think the answer is accurate. I'm not sure if my installation of TrueNAS uses ZFS, because this information doesn't appear to be present when I browse through Storage tabs and views. Also, I noticed TrueNAS Core UI actually has a top bar with bell icon which does contain notification about failed disk and degraded pool. I just didn't see the indication on a large screen, but at least it's there and I know where to look. Thanks.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Of course the answer is accurate, I wrote it. ;-) :tongue: ;-)

Every installation of TrueNAS uses ZFS. It has been years since FreeNAS dropped support for UFS.

TrueNAS is certainly going to let you know that there's a problem. Ideally you want to have e-mail notifications enabled, because it is super-inconvenient to have to log into a webUI to find out that a disk is having problems. Dedicated NAS units sometimes have a beeper or sounder that alerts anyone in the area of a problem, but there is no clear standard for this on Intel architecture servers, so in my opinion e-mail or proactive monitoring is necessary.

In general, TrueNAS expects that you will design a pool and your vdevs in a manner consistent with your expectations for data protection, but it is important to recognize that there's an underlying mindset in ZFS that is oriented towards a system with at least a dozen disks. I generally build a RAIDZ system with a dozen disks as an 11-disk RAIDZ3 with a warm spare disk, which means that when a disk fails, there is a slight drop in available redundancy for the amount of time it takes for the spare to be resilvered into place, but even during that time, there is still double-redundancy available, which aligns with the idea that a single disk failure should not constitute a crisis or compromise redundancy.

Many smaller two or four drive NAS units have a much more significant issue when there's a failure; two drive units are typically mirrored, and four drive units are often RAID5 or RAID10, both of which lose redundancy when there is a disk failure. In such cases, it is definitely important to be made immediately aware when there's a failure. You don't get the red flashing light of doom with FreeNAS or the annoy-o-beeper, and as you note, the in-UI notification of problems is not super-in-your-face about it.

So, just things to think about.
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
I restarted the thing and affected disk started showing as "Offline". When I tried to turn it "Online" out of curiosity, it would show "Faulted". If I then again click the option to send it "Offline", it would show "Offline" state immediately. I have shut the PC down and replaced cable for the affected disk. Then I started the PC up, waited for TrueNAS to boot and went into POOL1 status page. The disk was shown as "Offline". I clicked the three dot menu and selected "Online" option. The disk immediately went into "Online" state and dashboard was showing all green. I watched it for a moment and I noticed there was a bell notification after a moment:

2021-08-14_142103.png


I wasn't sure what was happening and thought maybe during replacing the cable I caused issue with another disk, so I went into Storage view:

2021-08-14_142150.png


I was wandering about for a moment there, because nothing would tell me why the pool is unhealthy or what that means (lost redundancy? lost data? For example, unhealthy state in Storage Spaces for Windows means completely lost array with data being unrecoverable, MSDN page for this state actually says the data needs to be recovered from a backup and there's no recovery of the pool possible). Eventually I noticed on pool status page, the UI was telling me a drive was being resilvered:

2021-08-14_142829.png


Somehow I was able to get into this view, but I don't remember how I did that:

2021-08-14_142130.png


I'll wait for TrueNAS to finish it's thing and see if data was lost or not. Please feel free to let me know if there was anything wrong with my approach.
 
Last edited:

rvassar

Guru
Joined
May 2, 2018
Messages
972
I restarted the thing and affected disk started showing as "Offline". When I tried to turn it "Online" out of curiosity, it would show "Faulted". If I then again click the option to send it "Offline", it would show "Offline" state immediately. I have shut the PC down and replaced cable for the affected disk.

I'll wait for TrueNAS to finish it's thing and see if data was lost or not. Please feel free to let me know if there was anything wrong with my approach.

You likely had a bad cable, or a cable under stress. I had a similar problem with my "gamer case", and a SATA cable that had to make a 90 deg bend to get the side cover on. Switching to pre-formed 90 deg bent ends solved the problem for me.

From you screenshots, your pool is RAIDz2, and should be able to handle losing two drives simultaneously. As others have mentioned, replacing a single drive in RAIDz2 is really a routine maintenance task. The risk to your data here is low, and your NAS should be essentially fully operational even while the resilvering task is running. Make sure your SMART tests, Scrubs, and notifications are set up correctly, and just use the thing. It will tell you if there's a problem. :cool:
 

emsicz

Explorer
Joined
Aug 12, 2021
Messages
78
The bad cable condition is confirmed, also during replacement I must have bumped into another cable, which has failed too during recovery. TrueNAS handled this with yet another notification, data was still accessible, but had no redundancy at the time. This was sufficient I think now that I know there is an on-screen notification in TrueNAS, I know where to look and email notifications are another option.

I received some great help here, thanks.

Update: I have replaced all SATA cables shortly after this and I have been using TrueNAS since without issues. I find the system incredibly stable and dependable (and RAM-hungry, so RAM upgrade is scheduled).
 
Last edited:
Top