WD80EFAX issues

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
I have a bunch of 8TB WD Reds (amongst other brands/types) in two pools, and half of them have run flawlessly for years now. The second pool is a recent addition, and during the burn in phase I discovered one drive would be thrown out after one or a couple of days (bit hazy on the exact timing as I was focusing on solving the problem first, and then I unfortunately got ill for a while). I suspected the drive, the SATA cable and the port on the controller so I gradually changed all three, and the replacement drive (which has been buzzing happily for one or two weeks now) is now using a new cable on a new port.

What baffles me, is that the rejected drive both checks out SMART wise, and a surface scan. It's SMART values do not seem radically different from one that's functioning on the same controller right now, however one error was recorded (see below). Because the SMART test reports as 'passed' an RMA will probably also be impossible, so I'd like to know if there's something I can still do with this drive, apart from using it as an expensive door stop...

What could cause this?

Faulty drive:
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD80EFAX-68KNBN0
Firmware Version: 81.00A81
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Mar 30 15:03:24 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 87) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 933) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 127 127 054 Old_age Offline - 112
3 Spin_Up_Time 0x0007 173 173 024 Pre-fail Always - 476 (Average 465)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1552
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 76
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 76
194 Temperature_Celsius 0x0002 224 224 000 Old_age Always - 29 (Min/Max 22/43)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 1518 hours (63 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 41 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 f8 88 0f bd 40 08 12:05:27.235 READ FPDMA QUEUED
60 00 08 88 11 bd 40 08 12:04:57.214 READ FPDMA QUEUED
60 00 00 88 10 bd 40 08 12:04:57.213 READ FPDMA QUEUED
60 00 f0 88 0e bd 40 08 12:04:57.212 READ FPDMA QUEUED
60 00 e8 88 0d bd 40 08 12:04:57.212 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.





Example of a healthy drive on the same controller:
Code:
smartctl 6.6 2017-11-05 r4594 [FreeBSD 11.1-STABLE amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD80EFAX-68KNBN0
Firmware Version: 81.00A81
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Mar 30 15:03:24 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 87) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 854) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 130 130 054 Old_age Offline - 100
3 Spin_Up_Time 0x0007 161 161 024 Pre-fail Always - 524 (Average 485)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 12
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 128 128 020 Old_age Offline - 18
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 1552
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 12
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 78
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 78
194 Temperature_Celsius 0x0002 171 171 000 Old_age Always - 38 (Min/Max 19/46)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 0 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
You haven't run any smart tests on your drives. Until you do you can't make an informed decision on whether the drive it bad or not.
 

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
I ran said test on my desktop using WDC DLG, which reported it as 'passed'. I also ran a full read surface scan, which it also passed. Both tests were run twice, the second time after reinserting the drive and finding it rejected again.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t]
No you have run zero tests and both disks have run zero long tests.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
In addition to "run a SMART test", also be aware that SMART tests are actually not that predictive of disk failure. So, what could cause this:

- You haven't run a long SMART test, do that first
- If SMART still doesn't show errors, chalk it up to "drive is failing, SMART doesn't know". Not that unusual.
 

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
No you have run zero tests and both disks have run zero long tests.

Please note that the output listed above was NOT from the drive tests, but SMART values and other drive properties read at the freenas box, which I assumed might help shed some light on the problem.

And as stated before, I used WDC Data Lifeguard Diagnostic on my desktop (which has a gui and unfortunately does not provide anything copypastable apart from a very brief result message), where said disk both passed a 'SMART drive quick self test' and an 'extended test', which is a full surface scan.

This is the part that baffles me, how the drive can pass these tests and still cause serious enough errors for FreeNAS to throw it out. I've never had this before.

What would cause this?
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
The first disk also has a crc error. Its your data, your drives and your choice so you do what you want.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
For starters, as Yorick advised, SMART tests don't always seem to report problems appropriately. I don't know anything about about WD's surface scan.

Let's start from the beginning: you say that the drive was thrown out during the burn-in phase.

What "burn-in phase" and what, if any, were the messages or other indication that the drive was rejected?

Did you run smartdrv or badblocks on this drive before installing it in your system?

While the answers would be interesting as a potential guide, if this drive were mine I would put it into my FreenNAS box (assuming that I had a spare sata port, which it sounds like you do) and run a smartdrv long test on it, then badblocks, then a smartdrv long test again, and see what they told me. I'd post the outputs of these tests here for a second opinion on the results, too.

I'd make a Linux boot disk with badblocks on it and run the tests in another machine if I didn't have a spare sata port in my FreeNAS box.
 
Last edited:

MiG

Dabbler
Joined
Jan 6, 2017
Messages
21
Perhaps I'm using the word incorrectly :) I meant copying test data to and from it for a while (usually a month or so) to see if errors pop up. I've built RAID arrays before ZFS and use tools like WDC DLG to run full tests on the drives before I connect them. As mentioned before, whenever I had issues with drives that weren't performance related (like those extremely slow Seagate 'Archive' drives), these tools would correctly identify them as having physical issues, so, until the very issue in this thread, I assumed they were testing the drives sufficiently.

I fortunately have plenty of spare ports on the server, so I can run smartdrv and badblocks over the weekend. See if anything comes out.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Sounds good.

I'd enable smart tests on all your drives, too - "years ago" I lifted what seems to be a sensible schedule from a suggestion of past admin Cyberjock, see below:

1586516459593.png


You need to enable for each drive, start the service, and set up the email for notifications, in addition to the scheduled tasks.
 
Top