Weird Checksum error behavior

CaedenV · Jul 27, 2017

Ok, so I am seeing an odd bit of behavior on my box

TLDR version:
When doing a scrub 2 drives always give errors, sometimes 2, and sometimes hundreds of thousands (currently up at near 800K). If there was a genuine issue I imagine it would be more consistant, but this wide range of checksum errors has me scratching my head. Also, there are no read or write errors being reported.
But the really odd thing is that both drives showing errors are reporting the exact same number of errors for both drives.... which is really weird.

Long Version:
I recently replaced the motherboard and added an 8th drive to the system (previous mobo had 7 SATA ports... which is dumb). So I backed everything up, destroid the array, and rebuild as a RAIDz2 across 8 3TB HDDs. Previous to the upgrade I had 1 drive giving errors, and I replaced it about a week before the mobo upgrade, and had no major issues. But after the upgrade I am consistently getting errors on ADA0 and ADA2. Everything 'seems' fine. Not seeing massive amounts of corruption (which I would expect with hundreds of thousands of errors). No major slow-downs or hiccups when in use.

But because I am getting the same errors on 2 drives it makes me wonder if perhaps it is a motherboard level issue. I am just at a loss on how to troubleshoot this issue.
Tonight I plan on moving the plugs around to see if I continue to get the errors on the same discs or same ports (assuming ADA0-7 line up with SATA 1-8... which I understand can be folly depending on the underlying hardware).... which makes me ask... how on earth do you get the system to give you the serial number?!?! I can see it on the disk list... why not use it for everything?!?! It makes it needlessly complicated when trying to troubleshoot.

Anywho; any thoughts, ideas, advice, etc would be greatly appreciated.

System info:
CPU: AMD A10 5800K
RAM: DDR3 4x8GB
HDDs: 2 old Seagates, 3 2 year old Seagates, and 3 referbed HGST enterprise drives All 3TB 7200rpm
-Note, one HGST, and one old Seagate are having the issue
OS: v9.10

[root@Fayth ~]# zpool status
pool: Spira
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://illumos.org/msg/ZFS-8000-9P
scan: scrub in progress since Thu Jul 27 19:05:11 2017
1.35T scanned out of 9.53T at 319M/s, 7h27m to go
64.6G repaired, 14.18% done
config:

NAME STATE READ WRITE CKSUM
Spira ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/23419a6d-71ad-11e7-bd59-7085c240155c ONLINE 0 0 0
gptid/2498fb5c-71ad-11e7-bd59-7085c240155c ONLINE 0 0 0
gptid/25d5f0a1-71ad-11e7-bd59-7085c240155c ONLINE 0 0 0
gptid/269c5571-71ad-11e7-bd59-7085c240155c ONLINE 0 0 0
gptid/277f1627-71ad-11e7-bd59-7085c240155c ONLINE 0 0 0
gptid/28e6c346-71ad-11e7-bd59-7085c240155c ONLINE 0 0 0
gptid/2a27458c-71ad-11e7-bd59-7085c240155c ONLINE 0 0 1.47M (repairing)
gptid/2afef4a8-71ad-11e7-bd59-7085c240155c ONLINE 0 0 1.47M (repairing)

errors: No known data errors

^^apparently millions of errors this time. yesterday's check netted 2 errors, and the day before was ~350k

nojohnny101 · Jul 27, 2017

You are taking the right steps to eliminate variables. I think the next logical step is to see if these checksum errors follow the drives (which you said you are doing). Also if you have known good sata cables then make sure to use those to eliminate that variable. Just to confirm, you don't have a RAID card or HBA correct?

If the errors do follow the drives, then you can start putting them through the tests (smart tests).

NZ_JJ · Jul 27, 2017

I'd also run memory & CPU tests something may not be seated correctly after the motherboard swap.

Sent from my iPhone using Tapatalk

Stux · Jul 27, 2017

You should wait for the two drives to repair first... if possible. If you move sata cables around... you could end up with another pair of faulty disks... and you only have dual-drive redundancy.

CaedenV · Jul 28, 2017

Stux said:
You should wait for the two drives to repair first... if possible. If you move sata cables around... you could end up with another pair of faulty disks... and you only have dual-drive redundancy.

Ya, I am waiting a bit on that.
So this morning it looks like it cleaned up nicely. I didn't have time to mess with the system before work, so I set it to do another scrub during the day and we will see what comes of it when I get home.
*fingers crossed* really hoping it comes back clean
Regardless, I am going to document the drives with the errors, shut the system down, move some stuff around, and see how things go.

rs225 · Jul 28, 2017

I would suspect the motherboard. I also suspect these errors are not actually on the disks; they are corruption happening in real-time during the read/scrub in the disk controller or motherboard.

CaedenV · Jul 28, 2017

nojohnny101 said:
You are taking the right steps...
If the errors do follow the drives, then you can start putting them through the tests (smart tests).

Correct, I am running in AHCI mode using the built-in mobo controller. Nothing fancy here :D

CaedenV · Jul 28, 2017

Ok, long day but finally got home.
Scrub completed after ~6 hours with 0 errors this time.

@rs225, I am with you. I am thinking more and more that the issue is somewhere other than disks. To go from 0, to hundreds, to 0, to millions, to 0 again is just too inconsistent to be a drive level problem. This was origionally a 'spare parts learning project' that has gotten rather out of hand over the last 3-4 years, and as such many of the parts are... shall we say... "less than ideal". I should have some birthday money coming soon, so I think I am going to invest in a new PSU (currently 1/2 of my HDDs are on Molex to SATA adapters off a 12 year old PSU), and some new locking SATA cables (current cables range in quality and age).

In the mean time I am going to do some more extensive testing of the mobo and RAM and see what comes of it.

I know memtestx86+ works good as a RAM test... what about an overall mobo test?

rs225 · Jul 29, 2017

I don't think there is a good mobo test; it is trial by fire. Also see if your two problem drives are on the same power rail.

CaedenV · Jul 29, 2017

rs225 said:
I don't think there is a good mobo test; it is trial by fire. Also see if your two problem drives are on the same power rail.

Yep! I think you are on to something. I have 4 HDDs on that particular 'string' (not sure how many strings are on a rail... it is quite old), but the 2 drives having issues are on the same Molex to SATA adapter... so ya. I am thinking there is a new PSU in my future next week. No more adapters, just straight connections.
On the plus side, I have been wring to it like crazy last night and today to make sure things are working before it goes back into 'production' and so far things seem stable again. It is such a huge relief!

Thanks for the help!

Important Announcement for the TrueNAS Community.

Weird Checksum error behavior

CaedenV

Dabbler

nojohnny101

Wizard

NZ_JJ

Dabbler

Stux

MVP

CaedenV

Dabbler

rs225

Guru

CaedenV

Dabbler

CaedenV

Dabbler

rs225

Guru

CaedenV

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Weird Checksum error behavior

Dabbler

Wizard

Dabbler

MVP

Dabbler

Guru

Dabbler

Dabbler

Guru

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Weird Checksum error behavior"

Similar threads