Best method for checking and replacing a faulty drive that's within warranty

Joined
Dec 26, 2021
Messages
20
Hey folks. I have an 8x 6TB RAIDZ2 pool that has been great for over a year now. All 8 of my drives are still under warranty. Now, just earlier today, my bimonthly scrub showed that a drive suffered 138 read errors. The last SMART test that ran on the drive was a quick test on January 1st. That test says that there were read errors:

# 1 Short offline Completed: read failure 90% 27132 1528286552

However, the top of the SMART data section says this:

SMART overall-health self-assessment test result: PASSED

What is the best thing to do at this point? Should I immediately run an extended SMART test? Replace the drive?

And, when it comes to replacing a drive, when said drive is still under warranty, what is the best, safest, and most cost efficient way to replace it with a new one? I have a few ideas:
  • Submit an RMA and wait for a replacement drive. My fears with this method are that my pool will go for an extended period of time with a missing drive. The last time I sent a drive for RMA, it took over a month for the manufacturer to finally ship the replacement. I would imagine that this is the cheapest but also the least safe method.
  • The same as the above but while waiting for the RMA replacement, buying a replacement drive for quick insertion into the pool while waiting for the RMA drive to be shipped and using it as a spare once it arrives.
  • Keep a spare drive on hand and use it immediately once the faulty drive starts to show. Then use the RMA drive as the new spare once it is shipped. The only thing I don't like about this method is that the warranty of the spare drive will fade away while it's not in use. If money wasn't a concern, this would be my go-to method.
  • Same as the above but instead of waiting for the RMA replacement drive to arrive to act as the new spare, I'd just purchase a new spare immediately after inserting the new one. Essentially, I'd always have one spare on hand with a gap of a few days at most.
I'm just curious as to what everyone's suggestions are. I have a cloud backup so I'm not worried about losing important data. I'm just trying to avoid a situation in which I'd need to download 30 terabytes of backups from the cloud.
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Do a long smart test now. And please post the results in [CODE][/CODE].
Going for the RMA is a must, with RAIDZ2 you are safe enough to wait a month imho.

But if you are paranoid and have Amazon Prime, you can just buy a replacement and use it for 3 weeks, then you can send it back to Amazon and get a refund (Prime allows you to send back any item for whatever reason within one month from the purchase, at least in Europe).

Edit: also, take a look at the multi_report script in my signature: it makes keeping note your drives' health easy!
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Last time I did returns with WD and with Seagate, they'd let you do an advance RMA for a nominal fee (not more than US$20)--they'd ship you the replacement drive, so you could burn it in, make sure it was OK, and resilver it into your pool, then ship back the old one. Worth looking into, IMO.
 
Joined
Dec 26, 2021
Messages
20
Do a long smart test now. And please post the results in [CODE][/CODE].
Going for the RMA is a must, with RAIDZ2 you are safe enough to wait a month imho.

But if you are paranoid and have Amazon Prime, you can just buy a replacement and use it for 3 weeks, then you can send it back to Amazon and get a refund (Prime allows you to send back any item for whatever reason within one month from the purchase, at least in Europe).

Edit: also, take a look at the multi_report script in my signature: it makes keeping note your drives' health easy!
That script is awesome! I was actually looking into something just a few weeks ago that would send me email updates about SMART data. You saved me there! As for the long smart test, I would follow through with your advice but I actually have 8 new hard drives arriving tomorrow so that I can do an upgrade. This failing drive decided to fail at the most convenient of times (I purchased the new drives last week). Should I take your advice and apply it as wise common practice to always run a long SMART test in the event of both a scrub and short test reporting read/write errors?

Last time I did returns with WD and with Seagate, they'd let you do an advance RMA for a nominal fee (not more than US$20)--they'd ship you the replacement drive, so you could burn it in, make sure it was OK, and resilver it into your pool, then ship back the old one. Worth looking into, IMO.
That is very much worth looking into. It seems like such an option would be more commonplace. Thank you for you insight!
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Should I take your advice and apply it as wise common practice to always run a long SMART test in the event of both a scrub and short test reporting read/write errors?
If you suspect your drives might have an issue always run a long smart test, the short one basically checks just the electronics.
Also, make sure to schedule frequent smart tests: personally, I do one short test every day and one long test every week.

That script is awesome!
I agree!
 
Last edited:

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
  • Keep a spare drive on hand and use it immediately once the faulty drive starts to show. Then use the RMA drive as the new spare once it is shipped. The only thing I don't like about this method is that the warranty of the spare drive will fade away while it's not in use. If money wasn't a concern, this would be my go-to method.
That is what I have been doing for years. It especially allowed me to burn-in the drive in advance. Yes, I have about 300 Euros lying around. But if I have enough money to go for RAIDZ2, because I value my data, then an additional drive should be ok from a budget point of view.

At the end of the day it is a personal decision how big a problem a data loss would be for you.
 
Joined
Dec 26, 2021
Messages
20
Also, make sure to schedule frequent smart tests: personally, I do one short test every day and one long test every week.
Do you not worry about grinding your drives’ health away with such frequent tests? I read somewhere that someone recommended a three short smart tests per month, one long test per month, and two scrubs per month and that’s what I’ve been doing ever since I’ve had my server. I saw in your signature that your setup has 2x 3TB drives. I would imagine that such frequent long tests don’t take very long. However, the last time a long test ran for my 6 TB drives, the faulty one took over 16 hours to finish. That and my new drives that should be arriving today are 20 TB drives. Unless everyone strongly suggests otherwise, I’ll stick with my current schedule mainly due to the time requirement and added stress imposed on the drives.

That is what I have been doing for years. It especially allowed me to burn-in the drive in advance. Yes, I have about 300 Euros lying around. But if I have enough money to go for RAIDZ2, because I value my data, then an additional drive should be ok from a budget point of view.

At the end of the day it is a personal decision how big a problem a data loss would be for you.
This method makes the most sense to me. I’ll probably start doing this if the manufacturer of my new drives is unable to do an advance RMA.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
That is what I have been doing for years. It especially allowed me to burn-in the drive in advance.
I agree - I have two servers, 12 drives - two burned-in drives on the shelf for instant replacement capability - if one is used, immediately order a replacement and burn it in on arrival, registering it for warranty once burn-in is completed successfully. Drives must be considered expendable, my data is not so.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Do you not worry about grinding your drives’ health away with such frequent tests?
I'm not @Davvo, but I have the same SMART test schedule, and no, I don't worry about that at all; that's what they're made for. And since some of my drives have 70k hours in service with no errors, I think my lack of concern in this regard is borne out.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Do you not worry about grinding your drives’ health away with such frequent tests? I read somewhere that someone recommended a three short smart tests per month, one long test per month, and two scrubs per month and that’s what I’ve been doing ever since I’ve had my server. I saw in your signature that your setup has 2x 3TB drives. I would imagine that such frequent long tests don’t take very long. However, the last time a long test ran for my 6 TB drives, the faulty one took over 16 hours to finish. That and my new drives that should be arriving today are 20 TB drives. Unless everyone strongly suggests otherwise, I’ll stick with my current schedule mainly due to the time requirement and added stress imposed on the drives.
In a smart short test the controller tests its own electronics (like the internal cache, the read/write circuits or the head electronics), its mechanical properties and does a quick read test of a small area of the plate(s).
In a long test the read check is executed on all the sectors of the plates, in addition to every check that's done in the short test.

The long test is the really important one, which is why I get the script report the same day after that is run. I could probably do without the short tests, but since they are really quick (under 2 minutes) I just run them everyday, with the bonus that since I have set up smart alerts in the NAS, I should (haven't had any chance to verify if it works) immediately receive an email alert when one of the short fails.

I like to keep track of the health of my drives and I weekly scrutinize each parameter even if the smart test says "trust me bro, it's all OK!". Probably I am a bit too meticulous, but I want to know asap if something is wrong with my disks (I don't have any spare ready, hot or cold). I mean, I run scrubs every 14 days... which I admit could be a bit excessive.

While my drives only have 3700 hours of power on time, meaning I don't have any real first-hand data to support my point, I copied @danb35 (and a quite a few other people I believe) schedule that's based on thousands of power on time.
 
Joined
Dec 26, 2021
Messages
20
I'm not @Davvo, but I have the same SMART test schedule, and no, I don't worry about that at all; that's what they're made for. And since some of my drives have 70k hours in service with no errors, I think my lack of concern in this regard is borne out.
That is a good point. Now, just out of curiosity, when running RAIZ2, what risks do I run by not running the smart tests as frequently? Does the added frequency just point out a failing drive earlier? If a drive fails on the first of the month but my long smart test doesn't catch it until the 15th, what's the worst that could happen?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
That is a good point. Now, just out of curiosity, when running RAIZ2, what risks do I run by not running the smart tests as frequently? Does the added frequency just point out a failing drive earlier? If a drive fails on the first of the month but my long smart test doesn't catch it until the 15th, what's the worst that could happen?
Worst thing that can happen is to suddenly see your pool not importable because 3 or more drives died.
 
Last edited:
Joined
Dec 26, 2021
Messages
20
Worst thing that can happen is to suddenly see your pool not importable because 3 or more drives died.
Okay gotcha. Thank you everyone for your feedback! I will take everything into account to improve my preventative maintenance habits.
 
Top