Impending disk failure : Elements Grow defect list

JoeAtWork · Oct 12, 2022

Hi All,

I have some seagate drives that have elements in the grow defect list.

The following test show no problems :

Smart test short
Smart test long
scrub

I am fairly certain disk with any sector in the "Elements in the grow defect list" are bad. I assume I cannot see the factory defects with any smart tool.

Is there a way TrueNAS can alert the administrator of impending disk failure?

Thanks,
Joe

ChrisRJ · Oct 12, 2022

If your question is whether a 100% reliable way exists to predict a disk failure within the next X hours, the answer is no. You do not provide any real information on what exactly your system reports. But in general any growing number of "defect" sectors, repeated reads, etc. is a heuristic for an upcoming death of the disk. So in that way you have your alert right now.

JoeAtWork · Oct 12, 2022

I have no impending disk failure via smart long or short test results. Here is an example of a disk I would pitch with 3 grow defect list errors. TrueNAS has not told me to pitch this disk :

Code:

root@store4[~]# smartctl -x /dev/da36
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST33000650SS
Revision:             0004
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5005670ae43
Serial number:        Z297DFxxx
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Wed Oct 12 13:58:50 2022 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        68 C

Manufactured in week 16 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  187
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  189
Elements in grown defect list: 3

Vendor (Seagate Cache) information
  Blocks sent to initiator = 417485354
  Blocks received from initiator = 1295119611
  Blocks read from cache and sent to initiator = 539773485
  Number of read and write commands whose size <= segment size = 401902274
  Number of read and write commands whose size > segment size = 11356

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 64117.93
  number of minutes until next internal SMART test = 5

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1615530308        0         0  1615530308          0     385042.822           0
write:         0        0         0         0          0      23242.169           0

Non-medium error count:      121


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   64092                 - [-   -    -]
# 2  Background short  Completed                   -   64080                 - [-   -    -]
# 3  Background short  Completed                   -   64068                 - [-   -    -]
# 4  Background short  Completed                   -   63995                 - [-   -    -]
# 5  Background short  Completed                   -   63971                 - [-   -    -]
# 6  Background long   Completed                   -   63966                 - [-   -    -]
# 7  Background short  Completed                   -   63922                 - [-   -    -]
# 8  Background short  Completed                   -   63898                 - [-   -    -]


Long (extended) Self-test duration: 27600 seconds [460.0 minutes]

Background scan results log
  Status: scan is active
    Accumulated power on time, hours:minutes 64117:56 [3847076 minutes]
    Number of background scans performed: 894,  scan progress: 79.53%
    Number of background medium scans performed: 894

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 52407:47  0000000156a52d53  [1,18,7]   Recovered via rewrite in-place
   2 52407:47  0000000156d2b50f  [1,18,7]   Recovered via rewrite in-place
   3 52407:47  0000000156d2f7c3  [1,18,7]   Recovered via rewrite in-place
 671 52433:52  0000000158996b10  [1,18,7]   Recovered via rewrite in-place
 >>>> log truncated, fetched 16124 of 49172 available bytes

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: power on
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c5005670ae41
    attached SAS address = 0x500143802633353f
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 2
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 2
     Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: power on
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c5005670ae42
    attached SAS address = 0x500143802633353d
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 2
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 2
     Phy reset problem count: 0

root@store4[~]#

joeschmuck · Oct 12, 2022

This drive has 3 defects. This is not terrible. You also have several sectors that were corrected by rewriting the data in the same location thus refreshing it as well. With over 64000 hours (about 7.3 years) on the drive it could be near the end of it's life.

To your question on if you can predict when a drive will fail? Well SMART was designed in an attempt to provide at least a 24 hour warning that a drive failure may be pending. You received this warning and it could fail at anytime now. It's not the most reliable system but it's what we have. The thing SMART cannot predict is motor drive problems or other mechanical failures. It have some values that try to predict that but a sudden failure does happen, for example the spindle motor won't keep spinning, if it spins at all.

For myself, I would monitor the number of defects and if they start going up, replace the drive. If the defect list goes above 5, I'd replace the drive.

The decision is yours. I fixed your posting to include CODE tags.

JoeAtWork · Oct 13, 2022

joeschmuck said:
This drive has 3 defects. This is not terrible. You also have several sectors that were corrected by rewriting the data in the same location thus refreshing it as well. With over 64000 hours (about 7.3 years) on the drive it could be near the end of it's life.

To your question on if you can predict when a drive will fail? Well SMART was designed in an attempt to provide at least a 24 hour warning that a drive failure may be pending. You received this warning and it could fail at anytime now. It's not the most reliable system but it's what we have. The thing SMART cannot predict is motor drive problems or other mechanical failures. It have some values that try to predict that but a sudden failure does happen, for example the spindle motor won't keep spinning, if it spins at all.

For myself, I would monitor the number of defects and if they start going up, replace the drive. If the defect list goes above 5, I'd replace the drive.

The decision is yours. I fixed your posting to include CODE tags.

Hi Joe,

Thanks for fixing my CODE tag, I will eventually remember to do that every time.

I had to truncate that log as the "rewrite in place" occurred 671 times. It looks like 2 things need to be charted with Nagios :

Number of rewrites in place per day
number of elements in the grow defect list

If I see those continue two metrics climb I should replace the drive. This seems like a lot of work to do weekly to find drives that are about to fail.

Thanks,
Joe

joeschmuck · Oct 13, 2022

Hi Joe,
With respect to monitoring the drives each week being a chore, you could try out the script I've been updating in the Resources called Multi-Report. It generates an email and presents a nice little chart of all your data. It also has alarm features. BUT, this a big but, I have very limited SAS drive support. Give the script a try. If it doesn't report the drive data needed then toss me a message via "Conversations". I will see if I can add proper decoding of your drive data into the script.

Take care,
Joe

JoeAtWork · Nov 3, 2022

joeschmuck said:
Hi Joe,
With respect to monitoring the drives each week being a chore, you could try out the script I've been updating in the Resources called Multi-Report. It generates an email and presents a nice little chart of all your data. It also has alarm features. BUT, this a big but, I have very limited SAS drive support. Give the script a try. If it doesn't report the drive data needed then toss me a message via "Conversations". I will see if I can add proper decoding of your drive data into the script.

Take care,
Joe

I have downloaded the script and I get an error when running it:

Code:

store254# ./multi_report.sh
./multi_report.sh: line 250: .
pool_capacity=zfs       # Select zfs or zpool for Zpool Status Report - Pool Size and Free Space capacities. zfs is default.

# Ignore or Activate Alarms
ignoreUDMA=false        # Set to true to ignore all UltraDMA CRC Errors for the summary alarm (Email Header) only, errors will appear in the graphical chart.
ignoreSeekError=true    # Set to true to ignore all Seek Error Rate/Health errors.  Default is true.
ignoreReadError=true    # Set to true to ignore all Seek Error Rate/Health errors.  Default is true.
ignoreMultiZone=false   # Set to true to ignore all MultiZone Errors. Default is false.
disableWarranty=true    # Set to true: File name too long
Multi-Report v1.6d-2 dtd:2022-10-09 (TrueNAS Core 12.0-U8)
No Config File Exists
Checking for a valid email within the script...
Valid email within the script = , using script parameters...

joeschmuck · Nov 3, 2022

I'm going to go out on a limb and say that you likely incorrectly edited the script, maybe adding your email address or changing something else and you didn't keep the exact format required (like quotations around a value). Vice trying to figure out what went wrong, place a clean copy of the script on the pool then run ./multi_report.sh -config and select "N" for New configuration file. Answer the questions for your email address and once done, a clean configuration file will be generated. Next run the script normally ./multi_report.sh and it should work fine without any errors.

Let me know if that fixes it. Also, you can customize the most used parameters using the -config and selecting the Advanced option. Make sure you select to Write the file or all your changes will not happen. When using the external configuration file, any script updates/changes in the future will be a simple replacement of the script and no further configuration is required. I'm trying to make it simple.

I will be posting a new version soon. I've made some changes to sort the Pool Names and Drive ID's, added another hard drive parameter for Helium, and just little things like that. I need to finish cleaning up my test code that I use to test it out and run it on Scale before I can post it. So maybe in a few days.

Goof Luck

JoeAtWork · Nov 29, 2022

Well I am filling one of my TrueNAS filers with DD data to 79% and I am seeing some disks taking nose dives during the write process. The metric I see from a smartctl -x that is incrementing is the "Total errors corrected read:" is rapidly increasing. So even though these drives are good on the smart long, short and scrubs they are hurting performance.

"total errors corrected" in a 30 minute span : 2,749,244

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    585247624        1         0   585247625          1    2005017.224           0
write:           0        0         0         0            0          157628.574           0
verify: 2347400210        0         0  2347400210          1     575072.209           1

joeschmuck · Nov 29, 2022

What command are you using to run the "dd"? Also, please provide the entire SMART output for the drive so we can see everything.

Thanks.

JoeAtWork · Nov 29, 2022

joeschmuck said:
What command are you using to run the "dd"? Also, please provide the entire SMART output for the drive so we can see everything.

Thanks.

It was the other member of the mirror(da20) that had more read errors(318 million) since 9:07AM to 5:30 PM today


dd if=/dev/random of=/mnt/vold/benchmark/dataRandFile.dd bs=1024k count=102400
for i in {100..620}; do cp -v /mnt/vold/benchmark/dataRandFile.dd "/mnt/vol53b/test/dataRandFile_$i.dd" ; done

smartctl -x results :


smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST33000650SS
Revision:             RS17
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5005580c8f7
Serial number:        Z294BXYZ
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Nov 29 17:24:06 2022 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     39 C
Drive Trip Temperature:        68 C

Manufactured in week 39 of year 2012
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  104
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1544
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 418587194
  Blocks received from initiator = 1469781713
  Blocks read from cache and sent to initiator = 2406111958
  Number of read and write commands whose size <= segment size = 1913627728
  Number of read and write commands whose size > segment size = 527519

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 69398.90
  number of minutes until next internal SMART test = 55

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   2812573004        0         0  2812573004          0     353438.300           0
write:         0        0         0         0          0     333095.254           0
verify: 4095301962        0         0  4095301962          0     576468.932           0

Non-medium error count:    73927

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  32   65535                 - [-   -    -]
# 2  Background short  Completed                  32   65535                 - [-   -    -]
# 3  Background long   Completed                  32   65535                 - [-   -    -]
# 4  Background short  Completed                  32   65535                 - [-   -    -]
# 5  Background short  Completed                  32   65535                 - [-   -    -]
# 6  Background short  Completed                  32   65535                 - [-   -    -]
# 7  Background short  Completed                  32   65535                 - [-   -    -]
# 8  Background short  Completed                  32   65535                 - [-   -    -]
# 9  Background short  Completed                  32   65535                 - [-   -    -]
#10  Background short  Completed                  32   65535                 - [-   -    -]
#11  Background short  Completed                  32   65535                 - [-   -    -]
#12  Background short  Completed                  32   65535                 - [-   -    -]
#13  Background long   Completed                  32   65535                 - [-   -    -]
#14  Background short  Completed                  32   65535                 - [-   -    -]
#15  Background short  Completed                  32   65535                 - [-   -    -]
#16  Background short  Completed                  32   65535                 - [-   -    -]
#17  Background short  Completed                  32   65535                 - [-   -    -]
#18  Background long   Completed                  32   65535                 - [-   -    -]
#19  Background short  Completed                  32   65535                 - [-   -    -]
#20  Background short  Completed                  32   65535                 - [-   -    -]

Long (extended) Self-test duration: 27600 seconds [460.0 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 69398:54 [4163934 minutes]
    Number of background scans performed: 243,  scan progress: 0.00%
    Number of background medium scans performed: 243

   #  when        lba(hex)    [sk,asc,ascq]    reassign_status
   1 57604:27  00000000200220a0  [1,18,7]   Recovered via rewrite in-place
   2 57604:30  0000000022b4a83f  [1,18,7]   Recovered via rewrite in-place
   3 57604:32  000000002491473f  [1,18,7]   Recovered via rewrite in-place
   4 57604:35  0000000026bac36d  [1,18,7]   Recovered via rewrite in-place
   5 57604:36  0000000027cc589a  [1,18,7]   Recovered via rewrite in-place
   6 57604:41  000000002bdc6d0c  [1,18,7]   Recovered via rewrite in-place
   7 57604:41  000000002c06952d  [1,18,7]   Recovered via rewrite in-place
   8 57604:41  000000002c5d9598  [1,18,7]   Recovered via rewrite in-place
   9 57604:42  000000002c5eb7f9  [1,18,7]   Recovered via rewrite in-place
  10 57604:42  000000002d14c6ea  [1,18,7]   Recovered via rewrite in-place
...
 663 43364:42  0000000004c21ad5  [1,18,7]   Recovered via rewrite in-place
 664 43365:04  00000000614206fd  [1,18,7]   Recovered via rewrite in-place
 665 43365:43  000000015c16006f  [1,18,7]   Recovered via rewrite in-place
 666 43367:44  0000000033e22af2  [1,18,7]   Recovered via rewrite in-place
 667 43369:45  0000000004b21b43  [1,18,7]   Recovered via rewrite in-place
 668 43373:55  00000000007a2580  [1,18,7]   Recovered via rewrite in-place
 669 43374:15  000000006cb60696  [1,18,7]   Recovered via rewrite in-place
 670 43379:07  000000000502061c  [1,18,7]   Recovered via rewrite in-place
 671 43379:19  00000000027a1a27  [1,18,7]   Recovered via rewrite in-place
 >>>> log truncated, fetched 16124 of 49172 available bytes

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: power on
    reason: unknown
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c5005580c8f5
    attached SAS address = 0x5001438023692ebf
    attached phy identifier = 4
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: power on
    reason: power on
    negotiated logical link rate: phy enabled; 6 Gbps
    attached initiator port: ssp=0 stp=0 smp=1
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c5005580c8f6
    attached SAS address = 0x5001438023692ebd
    attached phy identifier = 4
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0

joeschmuck · Nov 29, 2022

Unfortunately I got nothing solid. Well it does not appear the drive is failing, but maybe the read/write errors are from the drive read/write buffers where the drive reads more data anticipating the next piece will be requested, but then it's not.

The good indicators I see are: Everything was correctable, and no grown defect list. So I think the drive is doing well. Run a single test of badblocks (all 5 test patterns) and see what the results are. If there is not error after that, I would chock the errors up due to read/write buffers.

JoeAtWork · Nov 29, 2022

joeschmuck said:
Unfortunately I got nothing solid. Well it does not appear the drive is failing, but maybe the read/write errors are from the drive read/write buffers where the drive reads more data anticipating the next piece will be requested, but then it's not.

The good indicators I see are: Everything was correctable, and no grown defect list. So I think the drive is doing well. Run a single test of badblocks (all 5 test patterns) and see what the results are. If there is not error after that, I would chock the errors up due to read/write buffers.

I will run badblocks on the two suspect drives as well as fill them from the NVME disk with that DD file. If I see a single disk slow down it could have to do with an area on the disk that is weak.

I am trying to read/get this old SUN/Oracle blog post on end to end checksums https://hypercritical.co/fatbits/2005/12/09/zfs-data-integrity-explained

JoeAtWork · Jan 2, 2023

joeschmuck said:
I'm going to go out on a limb and say that you likely incorrectly edited the script

I think I messed it up by trying to paste the contents into a new file with vi. I have copied the file to the TrueNAS filer and I ran the -config and now running it on one of my filers.

joeschmuck · Jan 2, 2023

Version 1.6f is out and there have been a lot of changes, features added, mainly for those folks who like to customize every alarm for every drive. Version 2.0 will be out hopefully by 25 February 2023. I have started making updates to it already. This is going to be a cleaned up version and incorporate any bug fixes. I do not plan to incorporate any new features, but it's early and if I got a reasonable request, well I could add something else.

The only changes a person should be making to the script is the name it's called. That is the safe move. If you d not want to use an external configuration file then 'vi' will edit it, but just save it as the same file name. The trick it to ensure you do not alter the wrong data, formatting of the script is very important for it to work at all. A simple addition or removal of a quotation mark will cause a complete failure.

JoeAtWork · Jan 2, 2023

Hi Joe,

I just ran 1.6f and did not see where the destination email is at. I did run the -config and it looks like I need to hand edit a .txt file to get the reports emailed to me.

Thanks,
Joe

joeschmuck · Jan 2, 2023

If you ran -config, you should have selected the N)ew configuration file option and entered your email address. Then answer the remaining questions. I recommend doing the automatic compensation as well.

Important Announcement for the TrueNAS Community.

Impending disk failure : Elements Grow defect list

JoeAtWork

Contributor

ChrisRJ

Wizard

JoeAtWork

Contributor

joeschmuck

Old Man

JoeAtWork

Contributor

joeschmuck

Old Man

JoeAtWork

Contributor

joeschmuck

Old Man

JoeAtWork

Contributor

joeschmuck

Old Man

JoeAtWork

Contributor

joeschmuck

Old Man

JoeAtWork

Contributor

JoeAtWork

Contributor

joeschmuck

Old Man

JoeAtWork

Contributor

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Impending disk failure : Elements Grow defect list

Contributor

Wizard

Contributor

Old Man

Contributor

Old Man

Contributor

Old Man

Contributor

Old Man

Contributor

Old Man

Contributor

Contributor

Old Man

Contributor

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Impending disk failure : Elements Grow defect list"

Similar threads