Failing hard drive / bad cables / bad motherboard?

Status
Not open for further replies.

walker

Cadet
Joined
May 15, 2014
Messages
4
After months of problem-free operation, I'm getting the following error messages on boot:

CAM status: ATA Status Error
ATA status: 51 (DRDY SERV ERR), error: 84 (ICRC ABRT )
RES: 51 84 a8 91 3f 00 00 00 00 00 00
Retrying command

(repeated over and over, except that the address in the RES line changes each time)

I occasionally had similar issues when this box ran Linux. The issues would come and go, and multiple drives were sometimes affected, although SMART was always clean (and still is now). After doing a bit of searching, I came across suggestions that it might not even be the hard drive(s) that were the problem, but rather faulty SATA cables, or "noise" (interference) from the power supply, or a bad motherboard, or...

Anyway, what can I do to track down the source of the error? I'm on FreeNAS-9.1.1-RELEASE-x64 (a752d35), in case that makes a difference. I'm not worried about data loss, because I've been exporting snapshots to an external drive regularly, and if this goes down for a week or two while I get parts, it won't be a big deal. But I would like to fix this.

zpool status -v outputs the following:
Code:
  pool: Primary
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 121M in 0h6m with 0 errors on Thu May 15 21:06:08 2014
config:
 
NAME                                            STATE     READ WRITE CKSUM
Primary                                         ONLINE       0     0     0
  raidz1-0                                      ONLINE       0     0     0
    gptid/125bca84-5550-11e3-b667-485b39a7a747  ONLINE       0     0     0
    gptid/12d05cf3-5550-11e3-b667-485b39a7a747  ONLINE       0     0     0
    gptid/133f9e29-5550-11e3-b667-485b39a7a747  ONLINE       0     0     0
 
    gptid/139fe25f-5550-11e3-b667-485b39a7a747  ONLINE       0     0     3
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
We could use more info about your system. Without that, the best advice we can give is to replace the SATA cables if you suspect them, run S.M.A.R.T. tests ASAP and check for any loose connections.
 

walker

Cadet
Joined
May 15, 2014
Messages
4
Thanks! I guess I'll start with the cables...SMART tests come back clean, which is part of why I'm not so sure it's the drive. (Plus the intermittency.) I'm about to be traveling for a while, but I'll see about doing more troubleshooting when I get back.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Thanks! I guess I'll start with the cables...SMART tests come back clean, which is part of why I'm not so sure it's the drive. (Plus the intermittency.) I'm about to be traveling for a while, but I'll see about doing more troubleshooting when I get back.

Did you run a long test?
 

walker

Cadet
Joined
May 15, 2014
Messages
4
Did you run a long test?

Yeah, I just ran a long self test and it came up clean; past runs when I was seeing issues also came up clean. (Also, when I rebooted my box before running the test, the errors I was seeing in `zpool status -v` went away too...the intermittency is frustrating.)
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
What hardware are you using, specifically?
 

Z300M

Guru
Joined
Sep 9, 2011
Messages
882
Yeah, I just ran a long self test and it came up clean; past runs when I was seeing issues also came up clean. (Also, when I rebooted my box before running the test, the errors I was seeing in `zpool status -v` went away too...the intermittency is frustrating.)
Check the power connections, not just the data cables.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,525
The error counts in zpool status clear when you reboot the box(or unmount the pool). So that's totally normal.
 

walker

Cadet
Joined
May 15, 2014
Messages
4
Ah, okay, that's good to know that the error counts clearing is normal. The error messages I was seeing earlier didn't show back up on screen, though. Good call on the power cables; I can check those too. (I've definitely unplugged/replugged them in the past, but perhaps they should be replaced or something, along with the SATA cables?)

As far as hardware, here's what I'm using:

Motherboard: ASUS M4A785TD-V EVO AM3 (AMD)
Power supply: OCZ Fatal1ty 550W
RAM: 12 GB ECC RAM
Hard drives:
- 3 are Hitachi HDS721050CLA362 (0F10381) 500GB 7200 RPM 16MB Cache SATA
- unfortunately, I don't remember what the fourth is, nor do I remember if the "failing" one is a Hitach or if it's the fourth drive (and I just left for an internship, so I won't be able to check for a couple of months)
- on the other hand, that may not matter too much, since I've had multiple drives exhibit this behavior...I'm not convinced it's really the drives
SATA cables: appear to be shielded (silver braid is visible through the clear plastic exterior), don't remember the brand, probably pretty cheap
HD power cables: don't remember right now, unfortunately

Is there any other hardware that might affect things?
 
Status
Not open for further replies.
Top