"Currently unreadable (pending) sectors." Fix Process Verification

Nalith · Aug 31, 2022

Hi All

I recently started receiving the following alerts for one of my disks:

-------------------------------------------------------------------------------------
TrueNAS @ truenas

Current alerts:

Device: /dev/sdc [SAT], 1 Currently unreadable (pending) sectors

-------------------------------------------------------------------------------------

Ask:

Based on my research this seems to be a potential bad sector developing on my drive and will only be re-allocated on an attempted write. I am looking at performing the following steps on my drive, please could someone validate if this is correct and/or there is a better way of addressing this issue:

1. Offline the disk in the GUI
2. sysctl kern.geom.debugflags=0x10 (To enable raw write mode to the disk)
3. dd if=/dev/sdc of=/dev/sdc bs=1m
4. sysctl kern.geom.debugflags=0x00
5. Online disk in the GUI
6. Run a scrub in from the GUI

I have a hot spare, would it make more sense to rather just swop this disk out with the hot spare and then swop it back when done? And if so, if I swop a 8TB disk in the existing 4TB disk space would I be able to swop back to the 4TB (Since my video would not have expanded?

Below is some additional info of my configuration:
-------------------------------------------------------------------------------------

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.131+truenas] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Gold

Device Model: WDC WD4002FYYZ-01B7CB1

Serial Number: K7J03P7T

LU WWN Device Id: 5 000cca 269dc3cd7

Firmware Version: 01.01M03

User Capacity: 4,000,787,030,016 bytes [4.00 TB]

Sector Size: 512 bytes logical/physical

Rotation Rate: 7200 rpm

Form Factor: 3.5 inches

Device is: In smartctl database [for details use: -P show]

ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4

SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Wed Aug 31 14:23:23 2022 SAST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x80) Offline data collection activity

was never started.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 113) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 571) minutes.

SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0

2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 108

3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0

4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3493

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0

7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0

8 Seek_Time_Performance 0x0005 140 140 020 Pre-fail Offline - 15

9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 30819

10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 256

192 Power-Off_Retract_Count 0x0032 094 094 000 Old_age Always - 7586

193 Load_Cycle_Count 0x0012 094 094 000 Old_age Always - 7586

194 Temperature_Celsius 0x0002 133 133 000 Old_age Always - 45 (Min/Max 14/55)

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0

197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1

198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 30805 -

# 2 Short offline Completed without error 00% 30782 -

# 3 Short offline Completed without error 00% 30758 -

# 4 Short offline Completed without error 00% 30734 -

# 5 Short offline Completed without error 00% 30710 -

# 6 Short offline Completed without error 00% 30686 -

# 7 Short offline Completed without error 00% 30662 -

# 8 Short offline Completed without error 00% 30638 -

# 9 Short offline Completed without error 00% 30614 -

#10 Extended offline Completed without error 00% 30603 -

#11 Short offline Completed without error 00% 30590 -

#12 Short offline Completed without error 00% 30566 -

#13 Short offline Completed without error 00% 30542 -

#14 Short offline Completed without error 00% 30518 -

#15 Short offline Completed without error 00% 30494 -

#16 Short offline Completed without error 00% 30470 -

#17 Short offline Completed without error 00% 30446 -

#18 Short offline Completed without error 00% 30422 -

#19 Short offline Completed without error 00% 30398 -

#20 Short offline Completed without error 00% 30374 -

#21 Short offline Completed without error 00% 30350 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

-------------------------------------------------------------------------------------

pool: Pool-01

state: ONLINE

scan: resilvered 13.4M in 00:00:02 with 0 errors on Wed Aug 31 13:37:43 2022

config:

NAME STATE READ WRITE CKSUM

Pool-01 ONLINE 0 0 0

raidz1-0 ONLINE 0 0 0

b839e106-e667-447e-885a-8f3775bd88ed ONLINE 0 0 0

1623e5ee-f85c-4476-bf91-0450c8f7064b ONLINE 0 0 0

301af189-a436-4119-90ea-e6d07933c9fe ONLINE 0 0 0

a045e608-5d1d-4c90-9128-6fb0dbb0518d ONLINE 0 0 0

6031d723-49bc-4ec7-bf24-88330c616d70 ONLINE 0 0 0

raidz1-1 ONLINE 0 0 0

cdd682e4-34f8-48de-b5da-db9829a5474f ONLINE 0 0 0

f3adac58-f47b-4ec3-ad39-0651fa9cfadf ONLINE 0 0 0

8239d2e9-8895-4331-aecb-20cdfbc29190 ONLINE 0 0 0

d3298cc5-e30d-40c8-8166-ec23e21b24fc ONLINE 0 0 0

043fa47b-fb66-45af-9f72-ed79287a2237 ONLINE 0 0 0

raidz1-2 ONLINE 0 0 0

1843c3b6-6afd-4a20-bf0d-11b092c780f0 ONLINE 0 0 0

326f2178-3d80-46e2-979e-5c665b56e493 ONLINE 0 0 0

c35614f3-d614-4ff4-b3b0-8d447ae1395b ONLINE 0 0 0

66aff8ba-24f3-44ab-bebd-9521e8db8a6a ONLINE 0 0 0

999426ca-24c9-4179-9645-a2b1a66dee06 ONLINE 0 0 0

cache

464b2a29-9fa3-44a1-8560-8719e08d7360 ONLINE 0 0 0

spares

b09d9cb1-4b7d-45c6-b4e8-ca029b4b2920 AVAIL

errors: No known data errors

-------------------------------------------------------------------------------------

Thanks in advance!

souporman · Aug 31, 2022

Hey brother, I've been in your shoes many times before, personally and professionally. Yes, those commands you Googled are technically "the way to make that error message go away" but it doesn't mean it's fixing anything. There are an amount of blocks on your disk that are marked for reallocation in the event of a bad sector like this. How many are there? I dunno. More than 1 and less than a bajillion. It doesn't really matter. Your disk is telling you that it's got a problem. If this is data I care about, I replace the disk, send this one off for an RMA and lose no sleep. If your existing disk has no warranty, and you don't really want to replace it for some reason, then keep a close eye on the usual suspect SMART attributes: "Current_Pending_Sector", "Offline_Uncorrectable_Errors" etc. If they start going UP then get that disk is dying even faster.

I didn't really feel like trying to figure out what your pool layout is and such (you gotta work on that formatting) so I can't comment on anything you pasted below your questions, but as for your question about replacing the 4TB disk with an existing 8TB disk... sure. Go nuts. You won't see the extra 4TB of space until you replace all your disks in that VDEV with 8TB, but it'll work fine, and maybe you never replace the remaining disks with 8TB disks and it would still be fine.

If I were you, I'd just go grab a 4TB disk for like $60, or whatever, and replace your dying disk with it. But then again, I've got kids and a job and I'm pretty busy. I don't really want to have to add "make sure you check the SMART attributes to be sure the counters aren't going up" to my list of things I do every day. Heck, I have that data e-mailed to me daily as it is and I forget to look at more than once a week or so.

EDIT: I lied and took a cursory glance at your pool structure. You are in danger, brother. I Hope you have the capability to back up your data elsewhere. When you replace this disk, you are going to need to resilver your pool. Depending on how full your pool is, this could take a long time. While this is happening, if any of your other 4 disks in that VDEV suddenly develop an Uncorrectable Error like this one did, instead of getting a little message about it, you'll probably be looking at a complete loss of your entire data pool. You want to know something worse than that? The Uncorrectable Read Error(URE) of the disks you're using (I'm assuming they are regular commodity SATA disks?) is about 1 in every 12TB of data read. This is really dumbing down how UREs work, but the TL;DR of the situation is that there's a pretty good chance your pool will not complete a resilver when you replace that disk before it runs into another URE. I would back up the pool now, then try to replace the disk and cross your fingers for the next few days. Then, I would figure out a way to either rebuild this pool as RAIDz2 (or mirrors) or get a new pool.

Nalith · Sep 1, 2022

souporman said:
Hey brother, I've been in your shoes many times before, personally and professionally. Yes, those commands you Googled are technically "the way to make that error message go away" but it doesn't mean it's fixing anything. There are an amount of blocks on your disk that are marked for reallocation in the event of a bad sector like this. How many are there? I dunno. More than 1 and less than a bajillion. It doesn't really matter. Your disk is telling you that it's got a problem. If this is data I care about, I replace the disk, send this one off for an RMA and lose no sleep. If your existing disk has no warranty, and you don't really want to replace it for some reason, then keep a close eye on the usual suspect SMART attributes: "Current_Pending_Sector", "Offline_Uncorrectable_Errors" etc. If they start going UP then get that disk is dying even faster. I didn't really feel like trying to figure out what your pool layout is and such (you gotta work on that formatting) so I can't comment on anything you pasted below your questions, but as for your question about replacing the 4TB disk with an existing 8TB disk... sure. Go nuts. You won't see the extra 4TB of space until you replace all your disks in that VDEV with 8TB, but it'll work fine, and maybe you never replace the remaining disks with 8TB disks and it would still be fine. If I were you, I'd just go grab a 4TB disk for like $60, or whatever, and replace your dying disk with it. But then again, I've got kids and a job and I'm pretty busy. I don't really want to have to add "make sure you check the SMART attributes to be sure the counters aren't going up" to my list of things I do every day. Heck, I have that data e-mailed to me daily as it is and I forget to look at more than once a week or so. EDIT: I lied and took a cursory glance at your pool structure. You are in danger, brother. I Hope you have the capability to back up your data elsewhere. When you replace this disk, you are going to need to resilver your pool. Depending on how full your pool is, this could take a long time. While this is happening, if any of your other 4 disks in that VDEV suddenly develop an Uncorrectable Error like this one did, instead of getting a little message about it, you'll probably be looking at a complete loss of your entire data pool. You want to know something worse than that? The Uncorrectable Read Error(URE) of the disks you're using (I'm assuming they are regular commodity SATA disks?) is about 1 in every 12TB of data read. This is really dumbing down how UREs work, but the TL;DR of the situation is that there's a pretty good chance your pool will not complete a resilver when you replace that disk before it runs into another URE. I would back up the pool now, then try to replace the disk and cross your fingers for the next few days. Then, I would figure out a way to either rebuild this pool as RAIDz2 (or mirrors) or get a new pool.

Thanks for the detailed reply, appreciate you taking the time to look though the info provided (Apologies for the awful formatting). I am considering moving my setup to 3 x 6 disk RAIDZ2 VDEV's but that is going to require me to get some additional disks and possibly move to a proper external DAS (I'm considering 2 x DELL MD1200 with a supported HBA card). You are correct, all my disks are SATA.

joeschmuck · Sep 1, 2022

My recommendations:
1) Run a Scrub on your pool to verify you have not data corruption.
2) Run a SMART Extended/Long test on the suspect drive, I suspect it will not complete. If it fails to complete, time to change the drive.
3) Monitor ID5 value, when it changes to something other than Zero, it's time to replace the drive. Right now you have "Pending" errors which just means there was an error reading a sector but after a few attempts, it read the data. If this happens many more times then the drive will map the sector out and increment ID5 value.

The step you posted above may be correct to rewrite the entire hard drive once but that just refreshes the data. Look at my Hard Drive Troubleshooting Guide for more details on how to try to force mapping out a sector for a range. (link below) But that will not "fix" the problem. Odds are you have some kind of surface damage. On rare occasions if the media flakes off, it could be limited to that one occurrence, but more often the media damage continues.

As for if you pop in an 8TB drive temporarily, well I have never tried that. It sounds like it might work fine if the pool cannot expand, if you give it a try, let us know how it works out.

As for the spare in the system, I would use that drive, that is what it's for. I also noticed you have a cache. Makes me wonder why. Most people put it in because they feel it replaces RAM, but it certainly does not and often slows down the system if you don't have enough RAM to properly support it. That was just an observation.

Nalith · Sep 11, 2022

joeschmuck said:
My recommendations:
1) Run a Scrub on your pool to verify you have not data corruption.
2) Run a SMART Extended/Long test on the suspect drive, I suspect it will not complete. If it fails to complete, time to change the drive.
3) Monitor ID5 value, when it changes to something other than Zero, it's time to replace the drive. Right now you have "Pending" errors which just means there was an error reading a sector but after a few attempts, it read the data. If this happens many more times then the drive will map the sector out and increment ID5 value.

The step you posted above may be correct to rewrite the entire hard drive once but that just refreshes the data. Look at my Hard Drive Troubleshooting Guide for more details on how to try to force mapping out a sector for a range. (link below) But that will not "fix" the problem. Odds are you have some kind of surface damage. On rare occasions if the media flakes off, it could be limited to that one occurrence, but more often the media damage continues.

As for if you pop in an 8TB drive temporarily, well I have never tried that. It sounds like it might work fine if the pool cannot expand, if you give it a try, let us know how it works out.

As for the spare in the system, I would use that drive, that is what it's for. I also noticed you have a cache. Makes me wonder why. Most people put it in because they feel it replaces RAM, but it certainly does not and often slows down the system if you don't have enough RAM to properly support it. That was just an observation.

Thanks for the help.

Important Announcement for the TrueNAS Community.

"Currently unreadable (pending) sectors." Fix Process Verification

Nalith

Cadet

souporman

Explorer

Nalith

Cadet

joeschmuck

Old Man

Nalith

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

"Currently unreadable (pending) sectors." Fix Process Verification

Nalith

Cadet

souporman

Explorer

Nalith

Cadet

joeschmuck

Old Man

Nalith

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: ""Currently unreadable (pending) sectors." Fix Process Verification"

Similar threads