Impending disk failure : Elements Grow defect list

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
Hi All,

I have some seagate drives that have elements in the grow defect list.

The following test show no problems :
  1. Smart test short
  2. Smart test long
  3. scrub
I am fairly certain disk with any sector in the "Elements in the grow defect list" are bad. I assume I cannot see the factory defects with any smart tool.

Is there a way TrueNAS can alert the administrator of impending disk failure?

Thanks,
Joe
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
If your question is whether a 100% reliable way exists to predict a disk failure within the next X hours, the answer is no. You do not provide any real information on what exactly your system reports. But in general any growing number of "defect" sectors, repeated reads, etc. is a heuristic for an upcoming death of the disk. So in that way you have your alert right now.
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
I have no impending disk failure via smart long or short test results. Here is an example of a disk I would pitch with 3 grow defect list errors. TrueNAS has not told me to pitch this disk :
Code:
root@store4[~]# smartctl -x /dev/da36
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST33000650SS
Revision:             0004
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5005670ae43
Serial number:        Z297DFxxx
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Oct 12 13:58:50 2022 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        68 C

Manufactured in week 16 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  187
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  189
Elements in grown defect list: 3

Vendor (Seagate Cache) information
  Blocks sent to initiator = 417485354
  Blocks received from initiator = 1295119611
  Blocks read from cache and sent to initiator = 539773485
  Number of read and write commands whose size <= segment size = 401902274
  Number of read and write commands whose size > segment size = 11356

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 64117.93
  number of minutes until next internal SMART test = 5

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1615530308        0         0  1615530308          0     385042.822           0
write:         0        0         0         0          0      23242.169           0

Non-medium error count:      121


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   64092                 - [-   -    -]
# 2  Background short  Completed                   -   64080                 - [-   -    -]
# 3  Background short  Completed                   -   64068                 - [-   -    -]
# 4  Background short  Completed                   -   63995                 - [-   -    -]
# 5  Background short  Completed                   -   63971                 - [-   -    -]
# 6  Background long   Completed                   -   63966                 - [-   -    -]
# 7  Background short  Completed                   -   63922                 - [-   -    -]
# 8  Background short  Completed                   -   63898                 - [-   -    -]


Long (extended) Self-test duration: 27600 seconds [460.0 minutes]

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 64117:56 [3847076 minutes]
    Number of background scans performed: 894,  scan progress: 79.53%
    Number of background medium scans performed: 894

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 52407:47  0000000156a52d53  [1,18,7]   Recovered via rewrite in-place
   2 52407:47  0000000156d2b50f  [1,18,7]   Recovered via rewrite in-place
   3 52407:47  0000000156d2f7c3  [1,18,7]   Recovered via rewrite in-place
 671 52433:52  0000000158996b10  [1,18,7]   Recovered via rewrite in-place
 >>>> log truncated, fetched 16124 of 49172 available bytes

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: power on
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c5005670ae41
    attached SAS address = 0x500143802633353f
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 2
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 2
     Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: power on
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c5005670ae42
    attached SAS address = 0x500143802633353d
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 2
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 2
     Phy reset problem count: 0

root@store4[~]#
 
Last edited by a moderator:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
This drive has 3 defects. This is not terrible. You also have several sectors that were corrected by rewriting the data in the same location thus refreshing it as well. With over 64000 hours (about 7.3 years) on the drive it could be near the end of it's life.

To your question on if you can predict when a drive will fail? Well SMART was designed in an attempt to provide at least a 24 hour warning that a drive failure may be pending. You received this warning and it could fail at anytime now. It's not the most reliable system but it's what we have. The thing SMART cannot predict is motor drive problems or other mechanical failures. It have some values that try to predict that but a sudden failure does happen, for example the spindle motor won't keep spinning, if it spins at all.

For myself, I would monitor the number of defects and if they start going up, replace the drive. If the defect list goes above 5, I'd replace the drive.

The decision is yours. I fixed your posting to include CODE tags.
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
This drive has 3 defects. This is not terrible. You also have several sectors that were corrected by rewriting the data in the same location thus refreshing it as well. With over 64000 hours (about 7.3 years) on the drive it could be near the end of it's life.

To your question on if you can predict when a drive will fail? Well SMART was designed in an attempt to provide at least a 24 hour warning that a drive failure may be pending. You received this warning and it could fail at anytime now. It's not the most reliable system but it's what we have. The thing SMART cannot predict is motor drive problems or other mechanical failures. It have some values that try to predict that but a sudden failure does happen, for example the spindle motor won't keep spinning, if it spins at all.

For myself, I would monitor the number of defects and if they start going up, replace the drive. If the defect list goes above 5, I'd replace the drive.

The decision is yours. I fixed your posting to include CODE tags.

Hi Joe,

Thanks for fixing my CODE tag, I will eventually remember to do that every time. :smile:

I had to truncate that log as the "rewrite in place" occurred 671 times. It looks like 2 things need to be charted with Nagios :
  1. Number of rewrites in place per day
  2. number of elements in the grow defect list
If I see those continue two metrics climb I should replace the drive. This seems like a lot of work to do weekly to find drives that are about to fail.

Thanks,
Joe
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Hi Joe,
With respect to monitoring the drives each week being a chore, you could try out the script I've been updating in the Resources called Multi-Report. It generates an email and presents a nice little chart of all your data. It also has alarm features. BUT, this a big but, I have very limited SAS drive support. Give the script a try. If it doesn't report the drive data needed then toss me a message via "Conversations". I will see if I can add proper decoding of your drive data into the script.

Take care,
Joe
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
Hi Joe,
With respect to monitoring the drives each week being a chore, you could try out the script I've been updating in the Resources called Multi-Report. It generates an email and presents a nice little chart of all your data. It also has alarm features. BUT, this a big but, I have very limited SAS drive support. Give the script a try. If it doesn't report the drive data needed then toss me a message via "Conversations". I will see if I can add proper decoding of your drive data into the script.

Take care,
Joe
I have downloaded the script and I get an error when running it:

Code:
store254# ./multi_report.sh
./multi_report.sh: line 250: .
pool_capacity=zfs       # Select zfs or zpool for Zpool Status Report - Pool Size and Free Space capacities. zfs is default.

# Ignore or Activate Alarms
ignoreUDMA=false        # Set to true to ignore all UltraDMA CRC Errors for the summary alarm (Email Header) only, errors will appear in the graphical chart.
ignoreSeekError=true    # Set to true to ignore all Seek Error Rate/Health errors.  Default is true.
ignoreReadError=true    # Set to true to ignore all Seek Error Rate/Health errors.  Default is true.
ignoreMultiZone=false   # Set to true to ignore all MultiZone Errors. Default is false.
disableWarranty=true    # Set to true: File name too long
Multi-Report v1.6d-2 dtd:2022-10-09 (TrueNAS Core 12.0-U8)
No Config File Exists
Checking for a valid email within the script...
Valid email within the script = , using script parameters...


 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
I'm going to go out on a limb and say that you likely incorrectly edited the script, maybe adding your email address or changing something else and you didn't keep the exact format required (like quotations around a value). Vice trying to figure out what went wrong, place a clean copy of the script on the pool then run ./multi_report.sh -config and select "N" for New configuration file. Answer the questions for your email address and once done, a clean configuration file will be generated. Next run the script normally ./multi_report.sh and it should work fine without any errors.

Let me know if that fixes it. Also, you can customize the most used parameters using the -config and selecting the Advanced option. Make sure you select to Write the file or all your changes will not happen. When using the external configuration file, any script updates/changes in the future will be a simple replacement of the script and no further configuration is required. I'm trying to make it simple.

I will be posting a new version soon. I've made some changes to sort the Pool Names and Drive ID's, added another hard drive parameter for Helium, and just little things like that. I need to finish cleaning up my test code that I use to test it out and run it on Scale before I can post it. So maybe in a few days.

Goof Luck
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
Well I am filling one of my TrueNAS filers with DD data to 79% and I am seeing some disks taking nose dives during the write process. The metric I see from a smartctl -x that is incrementing is the "Total errors corrected read:" is rapidly increasing. So even though these drives are good on the smart long, short and scrubs they are hurting performance.


"total errors corrected" in a 30 minute span : 2,749,244

Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 585247624 1 0 585247625 1 2005017.224 0 write: 0 0 0 0 0 157628.574 0 verify: 2347400210 0 0 2347400210 1 575072.209 1

1669760306239.png
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
What command are you using to run the "dd"? Also, please provide the entire SMART output for the drive so we can see everything.

Thanks.
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
What command are you using to run the "dd"? Also, please provide the entire SMART output for the drive so we can see everything.

Thanks.
It was the other member of the mirror(da20) that had more read errors(318 million) since 9:07AM to 5:30 PM today

dd if=/dev/random of=/mnt/vold/benchmark/dataRandFile.dd bs=1024k count=102400 for i in {100..620}; do cp -v /mnt/vold/benchmark/dataRandFile.dd "/mnt/vol53b/test/dataRandFile_$i.dd" ; done

smartctl -x results :
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: SEAGATE Product: ST33000650SS Revision: RS17 Compliance: SPC-4 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c5005580c8f7 Serial number: Z294BXYZ Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Tue Nov 29 17:24:06 2022 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Disabled or Not Supported Read Cache is: Enabled Writeback Cache is: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 39 C Drive Trip Temperature: 68 C Manufactured in week 39 of year 2012 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 104 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 1544 Elements in grown defect list: 0 Vendor (Seagate Cache) information Blocks sent to initiator = 418587194 Blocks received from initiator = 1469781713 Blocks read from cache and sent to initiator = 2406111958 Number of read and write commands whose size <= segment size = 1913627728 Number of read and write commands whose size > segment size = 527519 Vendor (Seagate/Hitachi) factory information number of hours powered up = 69398.90 number of minutes until next internal SMART test = 55 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 2812573004 0 0 2812573004 0 353438.300 0 write: 0 0 0 0 0 333095.254 0 verify: 4095301962 0 0 4095301962 0 576468.932 0 Non-medium error count: 73927 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed 32 65535 - [- - -] # 2 Background short Completed 32 65535 - [- - -] # 3 Background long Completed 32 65535 - [- - -] # 4 Background short Completed 32 65535 - [- - -] # 5 Background short Completed 32 65535 - [- - -] # 6 Background short Completed 32 65535 - [- - -] # 7 Background short Completed 32 65535 - [- - -] # 8 Background short Completed 32 65535 - [- - -] # 9 Background short Completed 32 65535 - [- - -] #10 Background short Completed 32 65535 - [- - -] #11 Background short Completed 32 65535 - [- - -] #12 Background short Completed 32 65535 - [- - -] #13 Background long Completed 32 65535 - [- - -] #14 Background short Completed 32 65535 - [- - -] #15 Background short Completed 32 65535 - [- - -] #16 Background short Completed 32 65535 - [- - -] #17 Background short Completed 32 65535 - [- - -] #18 Background long Completed 32 65535 - [- - -] #19 Background short Completed 32 65535 - [- - -] #20 Background short Completed 32 65535 - [- - -] Long (extended) Self-test duration: 27600 seconds [460.0 minutes] Background scan results log Status: waiting until BMS interval timer expires Accumulated power on time, hours:minutes 69398:54 [4163934 minutes] Number of background scans performed: 243, scan progress: 0.00% Number of background medium scans performed: 243 # when lba(hex) [sk,asc,ascq] reassign_status 1 57604:27 00000000200220a0 [1,18,7] Recovered via rewrite in-place 2 57604:30 0000000022b4a83f [1,18,7] Recovered via rewrite in-place 3 57604:32 000000002491473f [1,18,7] Recovered via rewrite in-place 4 57604:35 0000000026bac36d [1,18,7] Recovered via rewrite in-place 5 57604:36 0000000027cc589a [1,18,7] Recovered via rewrite in-place 6 57604:41 000000002bdc6d0c [1,18,7] Recovered via rewrite in-place 7 57604:41 000000002c06952d [1,18,7] Recovered via rewrite in-place 8 57604:41 000000002c5d9598 [1,18,7] Recovered via rewrite in-place 9 57604:42 000000002c5eb7f9 [1,18,7] Recovered via rewrite in-place 10 57604:42 000000002d14c6ea [1,18,7] Recovered via rewrite in-place ... 663 43364:42 0000000004c21ad5 [1,18,7] Recovered via rewrite in-place 664 43365:04 00000000614206fd [1,18,7] Recovered via rewrite in-place 665 43365:43 000000015c16006f [1,18,7] Recovered via rewrite in-place 666 43367:44 0000000033e22af2 [1,18,7] Recovered via rewrite in-place 667 43369:45 0000000004b21b43 [1,18,7] Recovered via rewrite in-place 668 43373:55 00000000007a2580 [1,18,7] Recovered via rewrite in-place 669 43374:15 000000006cb60696 [1,18,7] Recovered via rewrite in-place 670 43379:07 000000000502061c [1,18,7] Recovered via rewrite in-place 671 43379:19 00000000027a1a27 [1,18,7] Recovered via rewrite in-place >>>> log truncated, fetched 16124 of 49172 available bytes Protocol Specific port log page for SAS SSP relative target port id = 1 generation code = 0 number of phys = 1 phy identifier = 0 attached device type: expander device attached reason: power on reason: unknown negotiated logical link rate: phy enabled; 6 Gbps attached initiator port: ssp=0 stp=0 smp=1 attached target port: ssp=0 stp=0 smp=1 SAS address = 0x5000c5005580c8f5 attached SAS address = 0x5001438023692ebf attached phy identifier = 4 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0 relative target port id = 2 generation code = 0 number of phys = 1 phy identifier = 1 attached device type: expander device attached reason: power on reason: power on negotiated logical link rate: phy enabled; 6 Gbps attached initiator port: ssp=0 stp=0 smp=1 attached target port: ssp=0 stp=0 smp=1 SAS address = 0x5000c5005580c8f6 attached SAS address = 0x5001438023692ebd attached phy identifier = 4 Invalid DWORD count = 0 Running disparity error count = 0 Loss of DWORD synchronization = 0 Phy reset problem = 0 Phy event descriptors: Invalid word count: 0 Running disparity error count: 0 Loss of dword synchronization count: 0 Phy reset problem count: 0
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Unfortunately I got nothing solid. Well it does not appear the drive is failing, but maybe the read/write errors are from the drive read/write buffers where the drive reads more data anticipating the next piece will be requested, but then it's not.

The good indicators I see are: Everything was correctable, and no grown defect list. So I think the drive is doing well. Run a single test of badblocks (all 5 test patterns) and see what the results are. If there is not error after that, I would chock the errors up due to read/write buffers.
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
Unfortunately I got nothing solid. Well it does not appear the drive is failing, but maybe the read/write errors are from the drive read/write buffers where the drive reads more data anticipating the next piece will be requested, but then it's not.

The good indicators I see are: Everything was correctable, and no grown defect list. So I think the drive is doing well. Run a single test of badblocks (all 5 test patterns) and see what the results are. If there is not error after that, I would chock the errors up due to read/write buffers.
I will run badblocks on the two suspect drives as well as fill them from the NVME disk with that DD file. If I see a single disk slow down it could have to do with an area on the disk that is weak.

I am trying to read/get this old SUN/Oracle blog post on end to end checksums https://hypercritical.co/fatbits/2005/12/09/zfs-data-integrity-explained
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
I'm going to go out on a limb and say that you likely incorrectly edited the script

I think I messed it up by trying to paste the contents into a new file with vi. I have copied the file to the TrueNAS filer and I ran the -config and now running it on one of my filers.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Version 1.6f is out and there have been a lot of changes, features added, mainly for those folks who like to customize every alarm for every drive. Version 2.0 will be out hopefully by 25 February 2023. I have started making updates to it already. This is going to be a cleaned up version and incorporate any bug fixes. I do not plan to incorporate any new features, but it's early and if I got a reasonable request, well I could add something else.

The only changes a person should be making to the script is the name it's called. That is the safe move. If you d not want to use an external configuration file then 'vi' will edit it, but just save it as the same file name. The trick it to ensure you do not alter the wrong data, formatting of the script is very important for it to work at all. A simple addition or removal of a quotation mark will cause a complete failure.
 

JoeAtWork

Contributor
Joined
Aug 20, 2018
Messages
165
Hi Joe,

I just ran 1.6f and did not see where the destination email is at. I did run the -config and it looks like I need to hand edit a .txt file to get the reports emailed to me.

Thanks,
Joe
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
If you ran -config, you should have selected the N)ew configuration file option and entered your email address. Then answer the remaining questions. I recommend doing the automatic compensation as well.
 
Top