Hardware Error - Ram?

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Hello,

I noticed in shell that I'm plagued with the following:

root@truenas[~]# 2022 Dec 22 10:34:24 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:34:24 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 10:34:24 truenas [Hardware Error]: Error Addr: 0x00000000b0e2f700
2022 Dec 22 10:34:24 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 10:34:24 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:34:24 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:34:24 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:34:24 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
2022 Dec 22 10:34:25 truenas [Hardware Error]: Error Addr: 0x00000007df236500
2022 Dec 22 10:34:25 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400200
2022 Dec 22 10:34:25 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:34:25 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:39:36 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:39:36 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 10:39:36 truenas [Hardware Error]: Error Addr: 0x00000000762f0cc0
2022 Dec 22 10:39:36 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 10:39:36 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:39:36 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:39:36 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:39:36 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 10:39:36 truenas [Hardware Error]: Error Addr: 0x0000000092540b40
2022 Dec 22 10:39:36 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x0000c6040a400200
2022 Dec 22 10:39:36 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:39:36 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:44:47 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:44:47 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 10:44:47 truenas [Hardware Error]: Error Addr: 0x000000006f38a700
2022 Dec 22 10:44:47 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 10:44:47 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:44:47 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:44:47 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:44:47 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 10:44:47 truenas [Hardware Error]: Error Addr: 0x0000000064ab7a00
2022 Dec 22 10:44:47 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400201
2022 Dec 22 10:44:47 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:44:47 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:49:58 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:49:58 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 10:49:58 truenas [Hardware Error]: Error Addr: 0x00000001311b1bc0
2022 Dec 22 10:49:58 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 10:49:59 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:49:59 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:49:59 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:49:59 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
2022 Dec 22 10:49:59 truenas [Hardware Error]: Error Addr: 0x00000000a066ec40
2022 Dec 22 10:49:59 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400200
2022 Dec 22 10:49:59 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:50:00 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:55:09 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:55:09 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 10:55:09 truenas [Hardware Error]: Error Addr: 0x0000000116c2b540
2022 Dec 22 10:55:10 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 10:55:10 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:55:10 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 10:55:10 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 10:55:10 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
2022 Dec 22 10:55:10 truenas [Hardware Error]: Error Addr: 0x00000007df1787e0
2022 Dec 22 10:55:10 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400201
2022 Dec 22 10:55:10 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 10:55:10 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
root@truenas[~]# 2022 Dec 22 11:00:21 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:00:21 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:00:21 truenas [Hardware Error]: Error Addr: 0x0000000196adeb00
2022 Dec 22 11:00:21 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 11:00:21 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:00:21 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:05:32 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:05:32 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:05:32 truenas [Hardware Error]: Error Addr: 0x00000001198d8ac0
2022 Dec 22 11:05:32 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 11:05:32 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:05:32 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:05:32 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:05:32 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:05:32 truenas [Hardware Error]: Error Addr: 0x0000000092541400
2022 Dec 22 11:05:32 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400200
2022 Dec 22 11:05:32 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:05:32 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
root@truenas[~]# 2022 Dec 22 11:10:43 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:10:43 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:10:43 truenas [Hardware Error]: Error Addr: 0x00000000bc722f80
2022 Dec 22 11:10:43 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 11:10:43 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:10:44 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:10:44 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:10:44 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:10:44 truenas [Hardware Error]: Error Addr: 0x000000007f671840
2022 Dec 22 11:10:44 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400201
2022 Dec 22 11:10:44 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:10:44 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
root@truenas[~]# 2022 Dec 22 11:15:55 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:15:55 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:15:55 truenas [Hardware Error]: Error Addr: 0x0000000117cc1ac0
2022 Dec 22 11:15:55 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 11:15:55 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:15:55 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:15:55 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:15:55 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:15:55 truenas [Hardware Error]: Error Addr: 0x0000000692c77340
2022 Dec 22 11:15:55 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400201
2022 Dec 22 11:15:55 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:15:55 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:21:06 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:21:06 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:21:06 truenas [Hardware Error]: Error Addr: 0x0000000201ea0ac0
2022 Dec 22 11:21:06 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 11:21:06 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:21:06 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:21:06 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:21:06 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
2022 Dec 22 11:21:06 truenas [Hardware Error]: Error Addr: 0x00000002172f2680
2022 Dec 22 11:21:06 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400201
2022 Dec 22 11:21:06 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:21:06 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
root@truenas[~]# 2022 Dec 22 11:26:17 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:26:17 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:26:17 truenas [Hardware Error]: Error Addr: 0x00000007df218300
2022 Dec 22 11:26:17 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 11:26:17 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:26:17 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:26:18 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:26:18 truenas [Hardware Error]: CPU:0 (17:1:1) MC16_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
2022 Dec 22 11:26:18 truenas [Hardware Error]: Error Addr: 0x00000000af3fae40
2022 Dec 22 11:26:18 truenas [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000089010a400201
2022 Dec 22 11:26:18 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:26:18 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
2022 Dec 22 11:31:29 truenas [Hardware Error]: Corrected error, no action required.
2022 Dec 22 11:31:29 truenas [Hardware Error]: CPU:0 (17:1:1) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
2022 Dec 22 11:31:29 truenas [Hardware Error]: Error Addr: 0x00000007ff9d03c0
2022 Dec 22 11:31:29 truenas [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x00000a400a400101
2022 Dec 22 11:31:29 truenas [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
2022 Dec 22 11:31:29 truenas [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

What does this mean? Bad stick of RAM?

TIA,
Zain
 
Last edited:

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
Those are really consistent errors every 5 minutes 11 seconds precisely.
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
I just re-seated all of the RAM. We'll see if that clears things up.

Edit- nope, they're still prevalent.

Can anyone confirm this to be a RAM issue? I'd like to RMA it, if so.
 

neofusion

Contributor
Joined
Apr 2, 2022
Messages
159
I just re-seated all of the RAM. We'll see if that clears things up.

Edit- nope, they're still prevalent.

Can anyone confirm this to be a RAM issue? I'd like to RMA it, if so.
No one here can say for certain since we have no more to go on than the partial log you posted.
I suppose you could run memtest from a usb-stick and see what it says.

If you're running RAM that is overclocked (with preset XMP/DOCP or manually) it would be good to return to stock settings and see if the errors still happen.
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Thanks for the feedback.

I'm not overclocking anything on this system (prefer the reliability over the performance boost, plus the boost isn't required for my use).

Where might I find historical logs?

Thanks
 

neofusion

Contributor
Joined
Apr 2, 2022
Messages
159
I imagine the errors go into /var/log/messages. However, that would only tell you what you already know.
A second opinion from memtest86 would be more useful. Either memtest86 or memtest86+ would probably work fine.

It would be good if you can deduce which module is causing issues, making sure to try it in different slots to make sure that it's really the RAM-stick and not the slot itself erroring out.

Don't be surprised if you need to provide some proof when you RMA it.
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Yeah, I've played the swap game already. I"m not able to pin-point a specific module. It happens with different modules in different slots.

Memtest86+ is running at the moment, assume it'll take a while. Memtest gonna be able to find an issue, when the ram corrects them anyways?
 

neofusion

Contributor
Joined
Apr 2, 2022
Messages
159
Yeah, I've played the swap game already. I"m not able to pin-point a specific module. It happens with different modules in different slots.

Memtest86+ is running at the moment, assume it'll take a while. Memtest gonna be able to find an issue, when the ram corrects them anyways?
I am not familiar with the memtest+ variant myself, but I mentioned it anyway since it recently received a big update.
Memtest will take hours, it depends (among other things) on the amount of RAM in your system.

Memtest will show you a message if it encounters ECC errors, even if corrected. Unintuitively the errors-counter might not increment though. See attached image for what it might look like.
image_2828.jpg
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Memtest just displayed a big PASS in green text with a black background, but it appears to still be running. Should I let it continue running a second time?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Let it complete at least a few full cycles (make it run for a few days).
My humble opinion is that you are experiencing RAM errors that ECC is correcting. You could try reseating the CPU.
Please share your full hardware specs.
 

neofusion

Contributor
Joined
Apr 2, 2022
Messages
159
Memtest just displayed a big PASS in green text with a black background, but it appears to still be running. Should I let it continue running a second time?
That was fairly quick.
I would let it run at least 1 more full pass. Personally I aim for 4 passes unless I have really suspect RAM that I want to test more thoroughly.

Having no errors in Memtest86+ but errors in TrueNAS results in a dilemma. It's possible that Memtest86+, while recently updated from it's previous dilapidated state, does not actually deal with ECC RAM well - but that would be surprising. Perhaps try the other Memtest86 program for a second third opinion? I'm not sure what else to suggest.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Also do note that, as far as I am aware, the free version of memtest does not test ECC.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Memtest just displayed a big PASS in green text with a black background, but it appears to still be running. Should I let it continue running a second time?

Typically we bench test systems for at least a month before releasing them to customers, requiring no errors in that time. If you are having problems, you should probably make sure that you have a clean bill of health for at LEAST that amount of time, if not more.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Also do note that, as far as I am aware, the free version of memtest does not test ECC.
But it will log ECC errors it stumbles upon.

It's abundantly clear that there are a lot of ECC errors going on, so you'll want to narrow it down. Test one DIMM at a time, try a different CPU, test the DIMMs and CPU in a different motherboard, etc.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
@Zain
Please post your full system specs. It looks like you are using a Threadripper CPU, 128GB RAM, etc. But no real details.

Post which version of Scale you are using.

Has the system been running fine for a few weeks before these errors started happening or when did they start happening with respect to the latest configuration change?

It appears the ECC Errors you are having are all over the place, in other words likely in multiple memory modules. So we need to know the motherboard, where the RAM modules are plugged into, each capacity (make/model would be nice).

Take a look at this reference, it may help you identify the suspect module, but that is assuming it's just one module and not a bad CPU, Power, or Mother Board for example. As I said, it looks like the errors are all over the place, most are in the 1.9GB to 8.6GB range but you have one in the 31GB area, very odd. This is why it would be helpful to know what hardware you have. If you have 32GB RAM modules, it is possible one module could be your issue. Also, all the addresses are EVEN, not ODD.

How long did it take to have MemTest86 run a complete pass on your RAM? Based on your posting times it seems like it might have been about 4 hours. That is awful fast to run a complete test on 128GB. Maybe you are looking at the wrong PASS information. You should be running this test for at least 48 hours, if not longer for all the RAM you have and having RAM issues. If your system was up and running for 3 weeks and then the errors started, maybe test for four weeks.

So maybe you are on a Beta of Scale, this could be having an effect. I'm trying to make it obvious that some of our minds try to take in every possible possibility but without knowing any details about your system, it's difficult to provide great advice. Running MemTest86 is great advice based on the information provided and the best first step to isolating a RAM problem.

You should examine dmi decode to identify the memory banks just in case it is a RAM module.

Here is a good link for a little light reading. It might help you or just put some things into perspective.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Thanks for the responses here. I'm not sure how long the errors have been happening. I ended up re-installing TNS yesterday so that I could move TN onto ssds that I installed in the nvme slots in order to free up some ports on the HBA. I'm not sure if the logs from the old install will still be there or not. Looking at the debug file I just downloaded, it appears that the logs only go as far back as the new install =( Although, there's also a messages.1 file shows some older information (no ecc errors though, that I saw). I'll attach both here.

Mobo: Asrock Fatal1ty X399 Professional Gaming
CPU: TR 1950x
RAM: QTY 8 of AA335286 16GB ECC UDIMM DDR4-2666 PC4-21300 (ebay)
PSU: Corsair HX1200
GPU: Quadro P4000
HBA: LSI 9201-16i
NIC: HP NC552SFP Dual Port 10GB SFP+ Adapter (Aggregate 20GB to Unifi 16-XG switch via fiber)

I noticed that if I only have half of the ram installed (in the appropriate slots per the IOM), the errors don't appear to occur. As soon as I add a 5th or 6th stick, the errors populate almost immediately, and it don't seem to matter which other slots the sticks are installed in (D1/C1/A1/B1).
1671804938433.png


Side thought, are there RAM settings in the BIOS that should be altered in order to accommodate ECC ram?

Thanks
 

Attachments

  • messages.rar
    614.8 KB · Views: 122

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Looks like the issue is related to "channel" 1 then. Improper contact (too high or too low) between the CPU and the socket could be the issue: make sure to apply even force when closing the lid/tightening the screws.

In some motherboards you get the option to activate or deactivate ECC.

How old is your PSU?
 
Last edited:

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Looks like the issue is related to "channel" 1 then. Improper contact (too high or too low) between the CPU and the socket could be the issue: make sure to apply even force when closing the lid/tightening the screws.

In some motherboards you get the option to activate or deactivate ECC.

How old is your PSU?
Yeah, my research is suggesting the same thing. Odd that it started occurring after installing TNS on new SSDs though. Coincidence?

PSU is just a couple of months old. Upgraded from a Corsair RMX800 because of the number of drives, and I wanted a little extra efficiency because this thing runs 24/7.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Yeah, my research is suggesting the same thing. Odd that it started occurring after installing TNS on new SSDs though. Coincidence?
Well, I would say yes but I have not studied the mobo manual.
You could try switching to another M2 slot if you can and using another SSD. See if something changes.
PSU is just a couple of months old. Upgraded from a Corsair RMX800 because of the number of drives, and I wanted a little extra efficiency because this thing runs 24/7.
Then it's probably unrelated to the issue.
 
Top