Array getting Errors again

dashtesla · Feb 16, 2023

So new server, new and different hard drives (even different brand) i'm starting to see errors again:

Some images to show how everything is going so far, I don't want to just ssh and manually clear errors even though I don't think two new hard drives would all the sudden start failing and still pass SMART tests fine so I'm not sure what's going on here but since this is the second time on TrueNAS scale that I see these errors there's probably a reason for them.

Hardware:
Dell R510
2x Intel Xeon X5675
48GB DDR3 Registered ECC 1600mhz
9x WD Red/White Label 12TB (pool-a-z2)
3x WD Red Pro 10TB (pool-b-z1)
2x HP Seagate 300GB SAS 10k (boot)
1x 500GB NVMe + 1x 250GB NVMe (both for ARC L2 cache)
LSI HBA/Dell H310 IT Mode

sretalla · Feb 17, 2023

I would suggest looking into the detail in both:

smartctl -a /dev/sdg

and

smartctl -a /dev/sdh

I don't think that

dashtesla said:
still pass SMART tests fine

means what you think it means.

dashtesla · Feb 17, 2023

sretalla said:
I would suggest looking into the detail in both:

smartctl -a /dev/sdg

and

smartctl -a /dev/sdh

I don't think that

means what you think it means.

root@sv7[~]# smartcttl -a /dev/sdg
zsh: command not found: smartcttl
root@sv7[~]# smartctl -a /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.79+truenas] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Ultrastar (He10/12)
Device Model: WDC WD120EDAZ-11F3RA0
Serial Number: 5PK5NW2B
LU WWN Device Id: 5 000cca 291ecdab9
Firmware Version: 81.00A81
User Capacity: 12,000,138,625,024 bytes [12.0 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Feb 17 18:17:41 2023 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 241) Self-test routine in progress...
10% of test remaining.
Total time to complete Offline
data collection: ( 87) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: (1233) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0004 128 128 054 Old_age Offline - 108
3 Spin_Up_Time 0x0007 222 222 024 Pre-fail Always - 322 (Average 272)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 97
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000a 100 100 067 Old_age Always - 0
8 Seek_Time_Performance 0x0004 140 140 020 Old_age Offline - 15
9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 13351
10 Spin_Retry_Count 0x0012 100 100 060 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 97
22 Helium_Level 0x0023 100 100 025 Pre-fail Always - 100
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 646
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 646
194 Temperature_Celsius 0x0002 175 175 000 Old_age Always - 37 (Min/Max 22/46)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 292

SMART Error Log Version: 1
ATA Error Count: 292 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 292 occurred at disk power-on lifetime: 13277 hours (553 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 70 00 f0 2b c4 40 00 30d+02:48:05.226 READ FPDMA QUEUED
60 48 00 a8 2b c4 40 00 30d+02:48:05.226 READ FPDMA QUEUED
60 28 10 80 2b c4 40 00 30d+02:48:05.225 READ FPDMA QUEUED
60 20 08 60 2b c4 40 00 30d+02:48:05.225 READ FPDMA QUEUED
60 28 00 10 2b c4 40 00 30d+02:48:05.225 READ FPDMA QUEUED

Error 291 occurred at disk power-on lifetime: 13272 hours (553 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 20 08 a0 e7 a1 40 00 29d+21:49:16.434 READ FPDMA QUEUED
60 28 00 78 e7 a1 40 00 29d+21:49:16.434 READ FPDMA QUEUED
60 20 00 30 e7 a1 40 00 29d+21:49:16.434 READ FPDMA QUEUED
60 28 00 08 e7 a1 40 00 29d+21:49:16.433 READ FPDMA QUEUED
60 20 00 c0 e6 a1 40 00 29d+21:49:16.426 READ FPDMA QUEUED

Error 290 occurred at disk power-on lifetime: 6146 hours (256 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 10 80 03 01 40 00 00:05:40.782 READ FPDMA QUEUED
60 80 18 00 04 01 40 00 00:05:40.780 READ FPDMA QUEUED
60 80 08 00 03 01 40 00 00:05:40.780 READ FPDMA QUEUED
60 20 00 e0 02 01 40 00 00:05:40.780 READ FPDMA QUEUED
60 60 18 80 02 01 40 00 00:05:40.778 READ FPDMA QUEUED

Error 289 occurred at disk power-on lifetime: 6146 hours (256 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 00 80 fe 00 40 00 00:05:39.775 READ FPDMA QUEUED
60 80 08 00 ff 00 40 00 00:05:39.774 READ FPDMA QUEUED
60 80 00 00 fe 00 40 00 00:05:39.774 READ FPDMA QUEUED
60 80 00 80 fd 00 40 00 00:05:39.773 READ FPDMA QUEUED
60 80 00 00 fd 00 40 00 00:05:39.772 READ FPDMA QUEUED

Error 288 occurred at disk power-on lifetime: 6146 hours (256 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 08 00 fa 00 40 00 00:05:38.757 READ FPDMA QUEUED
60 80 00 80 f9 00 40 00 00:05:38.756 READ FPDMA QUEUED
60 80 00 00 f9 00 40 00 00:05:38.756 READ FPDMA QUEUED
60 80 08 80 f8 00 40 00 00:05:38.754 READ FPDMA QUEUED
60 80 00 00 f8 00 40 00 00:05:38.754 READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 13330 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute d

dashtesla · Feb 17, 2023

sretalla said:
I would suggest looking into the detail in both:

smartctl -a /dev/sdg

and

smartctl -a /dev/sdh

I don't think that

means what you think it means.

Above is the one for /dev/sdg i mean im not sure what to take from it other than it says PASSED i see some random errors but it's just a bunch of random hex letters and numbers it's not really human readable and if smart says the drive is fine then it should be fine still having enough bad blocks would trigger smart warning anyway and again it's a new drive, i do have two spares but the last time we did this in another thread it became a game of wack a mole, i replace a drive then another gets errors then another then another then before you know the entire array gets corrupted.

But since I have 2 spare drives here I'm gonna go ahead and replace the one with 291 since even if it was a connection problem with the drive in the past plus 2 more recent should be fine with a new drive but if it comes back then i'm gonna start looking into the backplane specially being sata not sas/converted.

WI_Hedgehog · Feb 17, 2023

"PASSED" means "less than the number of faults the OEM determined to be a FAIL." If you look into it, it's basically a 0/1 flag indicating the drive is usable (pass) or should be replaced ASAP (fail).

However, understanding the different values and if and how often they change can give valuable insight.

dashtesla · Feb 17, 2023

Here's what happened after I replaced the drive, every drive now has errors and one has faulted according to this image.

I'm not sure if this is a bug or something else has gone wrong with the server so I'm hoping someone can shed some light because it doesn't make any sense to me It's exactly what happened before in another unrelated server with unrelated parts and for there to be this much of a coincidence is just not likely. As for the data it's perfectly safe I have 2x copies of it on LTO so not concerned about data loss just want to fix it properly and figure out what's triggering it.

This is my main production server but i'll leave it like this for today/few hours to see wait for people here to see the thread and give some suggestions so i can test things paste commands etc otherwise my only plan is to once again start over using some different parts since i'm out of different hard drives i can use id just go for my other dell r510 and a different sas controller (my main suspect other than a sas expander) and the fact the hard drives are sata not sas possibly leading to connection issues. If it's not hardware then there's only one possibility which is software being the most likely case since i had the same problem with another server though it could be related to poor connection or other issues who knows..

sretalla · Feb 18, 2023

dashtesla said:
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 292

For sure that is the disk agreeing with ZFS that you're not communicating properly between the disk and the controller in some way.

dashtesla said:
Error 292 occurred at disk power-on lifetime: 13277 hours (553 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 41 00 00 00 00 00 Error: ICRC, ABRT at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 70 00 f0 2b c4 40 00 30d+02:48:05.226 READ FPDMA QUEUED
60 48 00 a8 2b c4 40 00 30d+02:48:05.226 READ FPDMA QUEUED
60 28 10 80 2b c4 40 00 30d+02:48:05.225 READ FPDMA QUEUED
60 20 08 60 2b c4 40 00 30d+02:48:05.225 READ FPDMA QUEUED
60 28 00 10 2b c4 40 00 30d+02:48:05.225 READ FPDMA QUEUED

The last error (same as the 291 before that, I would guess) is the request to read from the disk at several block addresses, but the command was aborted due to a CRC error.

If you don't resolve that issue (cabling/backplane/controller), you're eventually going to lose (at least access to) your pool with too many disks unable to provide data on request.

Jailer · Feb 18, 2023

That disk you listed in post #3 has had one short smart test run on it in it's lifetime. You need to run a long test and report back with the results. You should also have regular long and short test scheduled to keep an eye on disk health.

dashtesla · Feb 19, 2023

sretalla said:
For sure that is the disk agreeing with ZFS that you're not communicating properly between the disk and the controller in some way.

The last error (same as the 291 before that, I would guess) is the request to read from the disk at several block addresses, but the command was aborted due to a CRC error.

If you don't resolve that issue (cabling/backplane/controller), you're eventually going to lose (at least access to) your pool with too many disks unable to provide data on request.

I have a second dell server i can try so im moving everything to it, the thing is the first server i ever used is completely unrelated again it's a third server that had similar issues sure it's possible i happened to have 2 failing/bad sas hbas? but how likely is that and don't forget i replace one drive ALL get errors instantly that didn't have before, now again it may be due to the fact that resilvering puts a tough load on the array to repair the missing drive but i would like to at least have some more concrete answer as to what's causing this on completely unrelated servers the only thing in common here is that all of them i used dell h310 in IT mode but that's literally the most common hba people use for truenas so support wouldn't be a problem and it has worked well for years for me as well though i used to be on truenas core.

Since no one requested any further tests i'm gonna destroy the original array and load from tapes as it's easier for me than trying to repair this array, i'll also load all the hard drives on crystaldiskinfo on windows one by one to double check it's all good. I'm very disappointed with the servers i've built lately specially not being able to figure out what exactly is the culprit here, replacing all of it is really not a good solution when trying to diagnose problems..

dashtesla · Feb 19, 2023

So moved everything to another Dell R510 with different controller (same model though)/motherboard/backplane/controller/cables literally liquid cleaned the entire server not a spec of dust removed the motherboard and waited for the bgas to dry overnight after it air dried plus all cables all heatsinks cleaned and new thermal paste on everything only kept the hard drives which i just destroyed the old faulty pool and started over, now loading the data back from tapes will take a while being slow LTO 6 drive over network so hopefully it will be fine from now on if not i'll drop by again to post updates.

This second server i avoided because i was having issues with the RAM not getting detected but the liquid clean sorted everything so it was just corrosion on this second server, you never know hopefully it was just corrosion of some contacts in the first server somewhere or maybe a bad h310/cables/expander we'll never know.

WI_Hedgehog · Feb 20, 2023

You bring up a great question, what is the Relative Humidity in the server environment?

Google (if I remember correctly) found high humidity can be more detrimental to HDD lifespan than high heat.

Also: Paper: Environmental Conditions and Disk Reliability in Free-Cooled Datacenters

dashtesla · Feb 20, 2023

WI_Hedgehog said:
You bring up a great question, what is the Relative Humidity in the server environment?

Google (if I remember correctly) found high humidity can be more detrimental to HDD lifespan than high heat.

Also: Paper: Environmental Conditions and Disk Reliability in Free-Cooled Datacenters

It's on the drier side maybe 35-45% not 100% sure but i have hygrometers not too far from them it's also a little warm i have at least 15kw of power going through a room without adequate ventilation for it. Temps around 25-30c, outside it's winter so -10c to +10c (Netherlands).

Important Announcement for the TrueNAS Community.

Array getting Errors again

dashtesla

Explorer

sretalla

Powered by Neutrality

dashtesla

Explorer

dashtesla

Explorer

WI_Hedgehog

Guru

dashtesla

Explorer

sretalla

Powered by Neutrality

Jailer

Not strong, but bad

dashtesla

Explorer

dashtesla

Explorer

WI_Hedgehog

Guru

dashtesla

Explorer

Similar threads

Important Announcement for the TrueNAS Community.

Array getting Errors again

Explorer

Powered by Neutrality

Explorer

Explorer

Guru

Explorer

Powered by Neutrality

Not strong, but bad

Explorer

Explorer

Guru

Explorer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Array getting Errors again"

Similar threads