Disk failed Issue

Mattia3rd · Sep 4, 2023

Hi
yesterday I fall into an hard disk failure and I need help to correctly solve my situation.

I have a Poll with 4+1(spare) disks.
Now the Pool is in degraded state and this is the output

Now the question:
1. which steps I need to follow to put the pool in Healthy state again?
2. how can I identify the faulty disk? Is there a way to remove the disk sata cable safely, one by one, to find the ada4 faulty disk?

Thank you

Redcoat · Sep 4, 2023

Mattia3rd said:
how can I identify the faulty disk?

You need to find ada4: Storage>Disks will list the drive serial numbers by drive name - note the ada4 serial number, shut off your server and open it up, find the bad drive by serial number and remove it.

Replace it with a new drive, hopefully one already tested/burned-in, from your shelf. Restart your server and allow the pool to resilver the new diisk.

joeschmuck · Sep 4, 2023

How often do you run SMART tests on your drives, including SMART Long/Extended tests? Drive ada6 is looking like it could be in trouble as well. Just pointing out you may have other pending drive failures.

Mattia3rd · Sep 4, 2023

joeschmuck said:
How often do you run SMART tests on your drives, including SMART Long/Extended tests? Drive ada6 is looking like it could be in trouble as well. Just pointing out you may have other pending drive failures.

Hi
this is my smart test schedule

the only faulty disk to me is ada4, there is no avidence of other disk issues. Am i missing something?
What I don't understand is why, even if the spare disk is now online and active, the pool is still "degraded".
This is really strange

Have you any futher advice?

Mattia3rd · Sep 4, 2023

Redcoat said:
You need to find ada4: Storage>Disks will list the drive serial numbers by drive name - note the ada4 serial number, shut off your server and open it up, find the bad drive by serial number and remove it.

Replace it with a new drive, hopefully one already tested/burned-in, from your shelf. Restart your server and allow the pool to resilver the new diisk.

Thank you I'll do it, but I still have the doubt: do I need to do "detach" of the disk?
The question now is: why the pool is still in degraded status?
I had the spare disk to avoid this situation....

Redcoat · Sep 4, 2023

Mattia3rd said:
Thank you I'll do it, but I still have the doubt: do I need to do "detach" of the disk?
The question now is: why the pool is still in degraded status?
I had the spare disk to avoid this situation....

You do not need to detach ada4 - in faulted state it’s already detached. Glad to hear you have a prepared disk ready!

Mattia3rd · Sep 4, 2023

Redcoat said:
You do not need to detach ada4 - in faulted state it’s already detached. Glad to hear you have a prepared disk ready!

Thank you for confirmation!
Yes I had a spare disk set in there.

Just the last question: why the pool is still degraded even if the spare disk replaced the faulty one?
Is this normal?

joeschmuck · Sep 4, 2023

Mattia3rd said:
Am i missing something?

Oh yes you are, you are only running a SHORT test, this is a basic quick test to ensure the hardware is operating, it does perform a very small surface test but that is more to ensure the drive is just working, it does not surface scan the entire drive for defects, that is what a LONG test does.

My advice, and what I always advise for a person 99% of the time, run a daily Short test and a Weekly Long test. If you do not use your NAS at night (everyone in bed) then running the LONG test is best for those times, whilst the SHORT tests only take a few minutes (typically less than 2 minutes) so those can be done anytime. My schedule is to perform a Daily SHORT test at 1 AM and then schedule a Weekly LONG test each Thursday at 1:05 AM. This is the easy way to schedule the testing over scheduling a SHORT test for all but Thursday. If your drives run 100% of the time then there is no impact to the drive life, if you sleep your drives most of the time then I'd recommend possibly a different schedule.

Mattia3rd said:
Just the last question: why the pool is still degraded even if the spare disk replaced the faulty one?
Is this normal?

Yes, until the resilvering is completed and your "Spare" should detach automatically.

If you still have a degraded system after, please post.

Mattia3rd · Sep 4, 2023

joeschmuck said:
Oh yes you are, you are only running a SHORT test, this is a basic quick test to ensure the hardware is operating, it does perform a very small surface test but that is more to ensure the drive is just working, it does not surface scan the entire drive for defects, that is what a LONG test does.

My advice, and what I always advise for a person 99% of the time, run a daily Short test and a Weekly Long test. If you do not use your NAS at night (everyone in bed) then running the LONG test is best for those times, whilst the SHORT tests only take a few minutes (typically less than 2 minutes) so those can be done anytime. My schedule is to perform a Daily SHORT test at 1 AM and then schedule a Weekly LONG test each Thursday at 1:05 AM. This is the easy way to schedule the testing over scheduling a SHORT test for all but Thursday. If your drives run 100% of the time then there is no impact to the drive life, if you sleep your drives most of the time then I'd recommend possibly a different schedule.

Yes, until the resilvering is completed and your "Spare" should detach automatically.

If you still have a degraded system after, please post.

Thank you for great details.

This is current situation

pool: RaidZ01
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 36K in 00:00:01 with 0 errors on Sun Sep 3 04:54:47 2023
config:

NAME STATE READ WRITE CKSUM
RaidZ01 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/20c84800-9a4d-11e7-acd7-001b78595016 ONLINE 0 0 32
spare-1 DEGRADED 0 0 32
gptid/220ede4f-9a4d-11e7-acd7-001b78595016 FAULTED 1 2 0 too many errors
gptid/c9f54388-cc40-11e7-bb41-001b78595016 ONLINE 0 0 0
gptid/23389c39-9a4d-11e7-acd7-001b78595016 ONLINE 0 0 32
gptid/24568129-9a4d-11e7-acd7-001b78595016 ONLINE 3 216 0
cache
gptid/a8ecd5e8-73f6-11e9-aab3-001b78595016 ONLINE 0 0 0
spares
gptid/c9f54388-cc40-11e7-bb41-001b78595016 INUSE currently in use

errors: No known data errors

joeschmuck · Sep 4, 2023

First of all, do what you have been told, replace the failed drive.

Once that is complete and resilvering is done, I'd recommend you read your SMART data, if all looks good, run a SMART LONG test on each drive.

The commands smartctl -a /dev/ada2 to generate the data report, and smartctl -t long /dev/ada2 to run the SMART Long test on drive ada2. You can run SMART tests on all the drives at the same time, just repeat the test command for each drive. When you enter the command you should get a response stating how long the test takes to complete in minutes. Wait the specified time for each drive and then run that first command again. Post the output of the results using Code (look for the three dots to the right of the insert picture icon, select Code, then paste the data. Do this for all your drives. I assume ada0 and ada1 are a boot mirror, you might as well do those as well.

BUT, only take on what you feel comfortable with. Don't do too many things at once. Let the drive resilver before doing anything else.

Once you have posted the SMART data for the drives, someone will be able to provide you clear information on if the drive is suspect for failure or maybe already has a failure and you just haven't experienced it. One sure sign of failure is a SMART test that does not complete (failure). I see it a lot, it happens even to me.

I want to tell you more but I do not want to overwhelm you with information and cause you to goof and risk your data.

Best of luck to you.

Mattia3rd · Sep 6, 2023

joeschmuck said:
First of all, do what you have been told, replace the failed drive.

Once that is complete and resilvering is done, I'd recommend you read your SMART data, if all looks good, run a SMART LONG test on each drive.

The commands smartctl -a /dev/ada2 to generate the data report, and smartctl -t long /dev/ada2 to run the SMART Long test on drive ada2. You can run SMART tests on all the drives at the same time, just repeat the test command for each drive. When you enter the command you should get a response stating how long the test takes to complete in minutes. Wait the specified time for each drive and then run that first command again. Post the output of the results using Code (look for the three dots to the right of the insert picture icon, select Code, then paste the data. Do this for all your drives. I assume ada0 and ada1 are a boot mirror, you might as well do those as well.

BUT, only take on what you feel comfortable with. Don't do too many things at once. Let the drive resilver before doing anything else.

Once you have posted the SMART data for the drives, someone will be able to provide you clear information on if the drive is suspect for failure or maybe already has a failure and you just haven't experienced it. One sure sign of failure is a SMART test that does not complete (failure). I see it a lot, it happens even to me.

I want to tell you more but I do not want to overwhelm you with information and cause you to goof and risk your data.

Best of luck to you.

Thank you again.

Now I'm trying to put my data in safe place.
I bought 2 more disks and I created a new pool, then I copied all my data with a "cp" command to this new pool.
Still working on it

Heracles · Sep 6, 2023

joeschmuck said:
schedule a Weekly LONG test each Thursday at 1:05 AM

Hey @joeschmuck,

My two cents here is that I recommend people -NEVER- schedule anything between 01:00 AM and 03:00 AM. The reason is when time switches between Normal and Daylight Saving time, the change happens at 02:00 AM. Either the clock will go back to 01:00 and whatever was schedule between 01:00 and 02:00 will run twice, or it will jump to 03:00 AM and whatever scheduled between 02:00 and 03:00 will be skipped.

In the same way, I never schedule anything for the 29, 30 or 31 of a month because they will not be executed every months.

joeschmuck · Sep 6, 2023

@Heracles Those are some good points. However if there is a time change, I would expect every clock to change so the process remains all the same. At work we would set time to UTC on everything and that was it. If you had to look at a log date/time, yo would have to subtract 4 or 5 hours to get the local time. And I understand the end of the month thing. I prefer to use a day that resides in every month, like 28 or below. Or the first, second, or third Sunday of a month for example. People just need to be smart about how they schedule things. I would hate to try and coordinate something in a data center, Yikes!

Mattia3rd · Sep 12, 2023

Hi all
I'm back, I finally copied all my data to a new Pool which is 1 disk done.
Now I'd like to add a second disk (same type and size) but I'd like to have a RAID10 configuration, could you kindly tell me how to achieve that?
I want to add it as Spare, is this correct to achieve that configuration?

Heracles · Sep 12, 2023

First step is to turn your 1 drive vDev into a mirror. That is done by adding a second drive to that vDev.
Once that is done, you add more drives in the system, create new mirror vDev and add these in your existing pool.

And there you are : a serie of mirrors striped in a single pool.

Mattia3rd · Sep 13, 2023

Heracles said:
First step is to turn your 1 drive vDev into a mirror. That is done by adding a second drive to that vDev.
Once that is done, you add more drives in the system, create new mirror vDev and add these in your existing pool.

And there you are : a serie of mirrors striped in a single pool.

Thank for advice, but It would be better to have an illustrated guide, just to be sure.

I had the chance to figure it out by myself and the point is:

After a vdev is created, more drives cannot be added to that vdev

So here the points:
1. I created a Pool with 1 disk, I cannot convert it to RAID10, I can add a second disk but just as spare (for redundancy)
2. To achieve a RAID10 pool I have to create the pool with at least 2 disk with "mirror" option. Now the pool is RAID10 and new disks can be added, 2 at a time, to increase size.

So, this just to clarify.

The last thing that I have to configure is the extended smartctl check and I'm working on it, I let you know :)

NugentS · Sep 13, 2023

@Mattia3rd
As @Heracles says - add the new disk to the first, in the same vdev. Not a spare. I believe extend is the correct term. Make sure NOT to add the second disk as stripe. Adding the second disk as a spare would be a complete waste of time. If you extend the single disk vdev then it turns into a mirror (similar to RAID10)

"After a vdev is created, more drives cannot be added to that vdev"
Correct for RAIDZ, not correct for mirrors

Important Announcement for the TrueNAS Community.

Disk failed Issue

Mattia3rd

Explorer

Redcoat

MVP

joeschmuck

Old Man

Mattia3rd

Explorer

Mattia3rd

Explorer

Redcoat

MVP

Mattia3rd

Explorer

joeschmuck

Old Man

Mattia3rd

Explorer

joeschmuck

Old Man

Mattia3rd

Explorer

Heracles

Wizard

joeschmuck

Old Man

Mattia3rd

Explorer

Heracles

Wizard

Mattia3rd

Explorer

NugentS

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Disk failed Issue

Explorer

MVP

Old Man

Explorer

Explorer

MVP

Explorer

Old Man

Explorer

Old Man

Explorer

Wizard

Old Man

Explorer

Wizard

Explorer

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Disk failed Issue"

Similar threads