I think I have a dead hard drive, but can you help me verify?

Status
Not open for further replies.

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Hi there!

After many years of my system running great, I think I have a dead drive! I believe I set everything up properly initially so I should be able to replace it, but I want to make sure I'm properly diagnosing the issue and looking in the right places.

Screen Shot 2018-09-16 at 9.06.36 AM.png


I see this in the notifications section. That doesn't look good to me. I'm assuming ADA2 is busted. These are just in my notifications, though, so I wanted to dig into the pool a bit. When I go and click the pool, it shows a status of "healthy." See below.

Screen Shot 2018-09-16 at 9.11.43 AM.png

So now in this one, it shows every disk as online, and there are zero errors on the pool level. Is this all expected? Also, while it's possible there is something I can't see going on, performance is now awful. Like... I can't seem to get Plex movies to stream anymore. Here's my hunch... My hunch is that the drive is in bad enough shape that I'm seeing errors thrown in the notification section, but it's in okay enough shape that a quick status check is showing the drive is fine... Is that right?

I'm just looking for validation and/or correction on this. I'm about ready to order a new 4TB drive to replace ADA2, but I want to make sure I'm thinking about this properly. Thanks!!
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
After many years of my system running great, I think I have a dead drive! I believe I set everything up properly initially so I should be able to replace it, but I want to make sure I'm properly diagnosing the issue and looking in the right places.
The error is not that it is 'dead' but that it had bad sectors. Still, that means it is time to replace the drive because it is in the process of failing.
I see this in the notifications section. That doesn't look good to me. I'm assuming ADA2 is busted. These are just in my notifications, though, so I wanted to dig into the pool a bit. When I go and click the pool, it shows a status of "healthy." See below.
That is one thing that doesn't make sense about how hard drives report their status. They show as being healthy right up to the point where they don't work at all, even if they have massive numbers of bad sectors.
So now in this one, it shows every disk as online, and there are zero errors on the pool level.
You don't want to wait until their are pool errors because that would mean that your data has been damaged. Now is the time to replace the drive, before it gets bad enough that data is damaged.
Also, while it's possible there is something I can't see going on, performance is now awful. Like... I can't seem to get Plex movies to stream any more.
Depending on the kind of drive you are using, the drive could be repeatedly trying to read the damaged portion and that would drag down the system overall performance.
I'm just looking for validation and/or correction on this. I'm about ready to order a new 4TB drive to replace ADA2, but I want to make sure I'm thinking about this properly. Thanks!!
I keep a couple spares on hand so I am ready to replace a drive right away. If you need to order one, it is definitely time to pull the trigger and if all the drives you have are around the same age, you might want to get a couple because some of the others could go at any time.
Here is some reading to help you:

Slideshow explaining VDev, zpool, ZIL and L2ARC
https://forums.freenas.org/index.ph...ning-vdev-zpool-zil-and-l2arc-for-noobs.7775/

Terminology and Abbreviations Primer
https://forums.freenas.org/index.php?threads/terminology-and-abbreviations-primer.28174/

Building, Burn-In, and Testing your FreeNAS system
https://forums.freenas.org/index.php?resources/building-burn-in-and-testing-your-freenas-system.38/

Github repository for FreeNAS scripts, including disk burnin
https://forums.freenas.org/index.ph...for-freenas-scripts-including-disk-burnin.28/

Useful Commands
https://forums.freenas.org/index.php?threads/useful-commands.30314/#post-195192

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)
https://forums.freenas.org/index.ph...bleshooting-guide-all-versions-of-freenas.17/

Disk Price/Performance Analysis Buying Information
https://forums.freenas.org/index.ph...e-performance-analysis-buying-information.62/
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
After many years of my system running great,
How many years? Hard drives typically last in the range around five years. If they are beyond that, it might be time to plan a staged replacement of all the drives with newer, possibly larger, drives.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
You should post the output of smartctl -a /dev/ada2 and we can all take a look at the drive status.

Your error message looks very bad, over 40,000 sector errors! Are you running SMART Long tests at all? This should have warned you well in advance.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,079
It would be a good thing to see the SMART output of each of the drives. Are you able to SSH into your system?
The output should look something like this:
Code:
root@Emily-NAS:~ # smartctl -a /dev/da19
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital Gold
Device Model:	 WDC WD6002FRYZ-01WD5B0
Serial Number:	xxxxxxxxx
LU WWN Device Id: 5 000cca 255e90634
Firmware Version: 01.01M02
User Capacity:	6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	7200 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Sun Sep 16 15:03:43 2018 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(  113) seconds.
Offline data collection
capabilities:					(0x5b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										No Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 769) minutes.
SCT capabilities:			  (0x003d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000b   100   100   016	Pre-fail  Always	   -	   0
  2 Throughput_Performance  0x0005   138   138   054	Pre-fail  Offline	  -	   100
  3 Spin_Up_Time			0x0007   100   100   024	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0012   100   100   000	Old_age   Always	   -	   3
  5 Reallocated_Sector_Ct   0x0033   100   100   005	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000b   100   100   067	Pre-fail  Always	   -	   0
  8 Seek_Time_Performance   0x0005   140   140   020	Pre-fail  Offline	  -	   15
  9 Power_On_Hours		  0x0012   100   100   000	Old_age   Always	   -	   1399
 10 Spin_Retry_Count		0x0013   100   100   060	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   3
192 Power-Off_Retract_Count 0x0032   099   099   000	Old_age   Always	   -	   1747
193 Load_Cycle_Count		0x0012   099   099   000	Old_age   Always	   -	   1747
194 Temperature_Celsius	 0x0002   157   157   000	Old_age   Always	   -	   38 (Min/Max 25/50)
196 Reallocated_Event_Count 0x0032   100   100   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0022   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0008   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x000a   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	  1385		 -
# 2  Short offline	   Completed without error	   00%	  1361		 -
# 3  Extended offline	Completed without error	   00%	  1333		 -
# 4  Short offline	   Completed without error	   00%	  1313		 -
# 5  Extended offline	Completed without error	   00%	  1285		 -
# 6  Short offline	   Completed without error	   00%	  1265		 -
# 7  Short offline	   Completed without error	   00%	  1241		 -
# 8  Short offline	   Completed without error	   00%	  1217		 -
# 9  Short offline	   Completed without error	   00%	  1192		 -
#10  Extended offline	Completed without error	   00%	  1165		 -
#11  Short offline	   Completed without error	   00%	  1144		 -
#12  Extended offline	Completed without error	   00%	  1117		 -
#13  Short offline	   Completed without error	   00%	  1096		 -
#14  Short offline	   Completed without error	   00%	  1072		 -
#15  Short offline	   Completed without error	   00%	  1048		 -
#16  Short offline	   Completed without error	   00%	  1024		 -
#17  Extended offline	Completed without error	   00%	   997		 -
#18  Short offline	   Completed without error	   00%	   976		 -
#19  Extended offline	Completed without error	   00%	   949		 -
#20  Short offline	   Completed without error	   00%	   928		 -
#21  Short offline	   Completed without error	   00%	   904		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

root@Emily-NAS:~ #

 
Last edited:

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Thanks to everyone for the useful replies! I ordered a new 4TB drive, and I'll SSH in and post output tonight. I don't recall earlier warnings about the drive.
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Here is the output!
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.2-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate NAS HDD
Device Model:	 ST4000VN000-1H4168
Serial Number:	Z300ZDFR
LU WWN Device Id: 5 000c50 06516378f
Firmware Version: SC43
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:	Mon Sep 17 21:28:28 2018 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  ( 121)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection:		 (  107) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   1) minutes.
Extended self-test routine
recommended polling time:	 ( 532) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:			(0x10bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   110   085   006	Pre-fail  Always	   -	   159262736
  3 Spin_Up_Time			0x0003   091   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   87
  5 Reallocated_Sector_Ct   0x0033   081   081   010	Pre-fail  Always	   -	   24240
  7 Seek_Error_Rate		 0x000f   086   060   030	Pre-fail  Always	   -	   425269966
  9 Power_On_Hours		  0x0032   058   058   000	Old_age   Always	   -	   37272
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   91
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   001   001   000	Old_age   Always	   -	   272
188 Command_Timeout		 0x0032   099   094   000	Old_age   Always	   -	   193276477485
189 High_Fly_Writes		 0x003a   001   001   000	Old_age   Always	   -	   496
190 Airflow_Temperature_Cel 0x0022   045   037   045	Old_age   Always   FAILING_NOW 55 (Min/Max 45/62 #8821)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   34
193 Load_Cycle_Count		0x0032   100   100   000	Old_age   Always	   -	   91
194 Temperature_Celsius	 0x0022   055   063   000	Old_age   Always	   -	   55 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   001   001   000	Old_age   Always	   -	   55792
198 Offline_Uncorrectable   0x0010   001   001   000	Old_age   Offline	  -	   55792
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0

SMART Error Log Version: 1
ATA Error Count: 272 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 272 occurred at disk power-on lifetime: 37070 hours (1544 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 60 ff ff ff 4f 00  27d+10:41:26.155  READ FPDMA QUEUED
  60 00 f8 ff ff ff 4f 00  27d+10:41:26.153  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:41:26.152  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:41:26.152  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:41:26.152  READ FPDMA QUEUED

Error 271 occurred at disk power-on lifetime: 37070 hours (1544 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  27d+10:32:41.002  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:32:41.002  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:32:41.002  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:32:41.002  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:32:41.002  READ FPDMA QUEUED

Error 270 occurred at disk power-on lifetime: 37070 hours (1544 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 58 ff ff ff 4f 00  27d+10:30:13.190  READ FPDMA QUEUED
  60 00 58 ff ff ff 4f 00  27d+10:30:13.180  READ FPDMA QUEUED
  60 00 58 ff ff ff 4f 00  27d+10:30:12.429  READ FPDMA QUEUED
  60 00 58 ff ff ff 4f 00  27d+10:30:12.362  READ FPDMA QUEUED
  60 00 58 ff ff ff 4f 00  27d+10:30:12.356  READ FPDMA QUEUED

Error 269 occurred at disk power-on lifetime: 37070 hours (1544 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00  27d+10:29:23.660  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:23.659  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:23.659  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:23.659  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:23.659  READ FPDMA QUEUED

Error 268 occurred at disk power-on lifetime: 37070 hours (1544 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 c0 ff ff ff 4f 00  27d+10:29:04.008  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:04.008  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:04.008  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:04.008  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  27d+10:29:04.008  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed: read failure	   90%	 37118		 -
# 2  Short offline	   Completed: read failure	   30%	 36998		 -
# 3  Extended offline	Interrupted (host reset)	  90%	 36920		 -
# 4  Short offline	   Completed: read failure	   50%	 36877		 -
# 5  Short offline	   Completed: read failure	   90%	 36733		 -
# 6  Short offline	   Completed: read failure	   80%	 36613		 -
# 7  Extended offline	Completed: read failure	   50%	 36539		 3989485600
# 8  Short offline	   Completed: read failure	   80%	 36493		 -
# 9  Short offline	   Completed: read failure	   80%	 36373		 -
#10  Short offline	   Completed: read failure	   40%	 36253		 -
#11  Extended offline	Completed: read failure	   50%	 36179		 3989485600
#12  Short offline	   Completed: read failure	   90%	 36133		 -
#13  Short offline	   Completed: read failure	   90%	 35989		 -
#14  Short offline	   Completed: read failure	   90%	 35869		 -
#15  Extended offline	Completed: read failure	   10%	 35800		 -
#16  Short offline	   Completed: read failure	   90%	 35749		 -
#17  Short offline	   Completed: read failure	   90%	 35629		 -
#18  Extended offline	Completed without error	   00%	 31459		 -
#19  Short offline	   Completed without error	   00%	 31408		 -
#20  Short offline	   Completed without error	   00%	 31288		 -
#21  Short offline	   Completed without error	   00%	 31168		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



Other than the temperature, can someone help me interpret this a bit better? Thanks!
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Other than the temperature, can someone help me interpret this a bit better? Thanks!
Way too hot, tens of thousands of bad sectors, failing every SMART self-test in the last 5000 hours--stick a fork in it, it's done. Be sure to burn in the replacement before putting it in service.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
ID's 5, 190, 197, and 198 are all bad. Anything other than zero is not good, above 10 and it's very bad. Your SMART Tests have been failing for a very long time, why haven't you paid attention to it? There is a reason for running these tests and that is to let you know you have a drive which is begining to fail but you have taken it to a new limit of letting it go way too long. Do you have email notifications set up? Also I'd recommend that you run a SMART long test on all your other drives and post the results. Lets make sure that you don't have other failures and if so, lets address those to protect your data. Sorry I sound a bit harsh, I just don't want you to be put into this predicament again.
 

hungarianhc

Patron
Joined
Mar 11, 2014
Messages
234
Thanks to both of your for the feedback, and I don't mind the harsh tone.

Sorry I sound a bit harsh, I just don't want you to be put into this predicament again.

Dumb question - you mention my "predicament." I'm running a RAID-Z2. I have a drive that is almost failed, and I'll be replacing it tonight. Aren't I kinda doing things how they're supposed to be done? At this point, I can still lose another drive and not suffer data loss. Of course I'd hate to be running unprotected... Maybe I have a false sense of satisfaction, but where I sit, I'm thinking, "Well, you invested in a RAID-Z2 setup years ago, and now you have a failed drive, and you shall replace the drive, and everything in life is good!" So... what am I missing? It appears that I should be more nervous here.

Way too hot, tens of thousands of bad sectors, failing every SMART self-test in the last 5000 hours--stick a fork in it, it's done. Be sure to burn in the replacement before putting it in service.
I'd like to be able to read these SMART reports a bit better. Where are you seeing the tens of thousands of errors? Which line?

I see some of the following red flags:
Error 272 occurred at disk power-on lifetime: 37070 hours (1544 days + 14 hours)"

I also see these read errors:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 37118 -
# 2 Short offline Completed: read failure 30% 36998 -
# 3 Extended offline Interrupted (host reset) 90% 36920 -
# 4 Short offline Completed: read failure 50% 36877 -
# 5 Short offline Completed: read failure 90% 36733 -
# 6 Short offline Completed: read failure 80% 36613 -
# 7 Extended offline Completed: read failure 50% 36539 3989485600
# 8 Short offline Completed: read failure 80% 36493 -
# 9 Short offline Completed: read failure 80% 36373 -
#10 Short offline Completed: read failure 40% 36253 -
#11 Extended offline Completed: read failure 50% 36179 3989485600

In regard to the temperature, I live in a small environment with limited space, and there isn't much I can do about it. If that means my drives fail sooner, I can deal with the replacement costs. I also have two off-site backups in two different cities - so I feel fairly protected in terms of how safe my data is, but perhaps I'm missing some things.

THANK YOU SO MUCH FOR THE HELP!
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
you mention my "predicament."
Your predicament is you let your hard drive run for a very long time (probably several months) and didn't do anything to replace the hard drive. Under a RAIDZ2 you can lose two drives and the loss of a third drive leads to loss of all data. If you let this one drive go this bad for so long, what do the other drives look like.

For a better understanding of hard drive failures, reference the Hard Drive Troubleshooting Guide in my link. The fact that you are unaware of what is or isn't a failure is of concern and no one wants to see someone fail and lose data when it can be prevented. If you are not being notified of the failures then you need to setup your email so you get these warning.
 
Status
Not open for further replies.
Top