RAM memory scrub option?

AlainD · Aug 16, 2014

Hi

One theme that's very regularly mentioned is the need for ECC-ram and even the need to follow that up (via ecc-ram logs).
Unfortunately it's very difficult for us mortals to check ECC-ram .

ZFS adds checksums to it's data written to storage and has facilities to check it.
Why not add this to the ram based cache?
The possibility to scrub the cache would take out some worries with ECC-ram users and probably give non-ECC ram users a faster warning, hopefully before disaster.

It doesn't has to scrub all the time or high speed, but from time to time (daily?) would be good enough for 99% of all uses and much better than now.

Ericloewe · Aug 16, 2014

This has been discussed to death. Several times.

ECC is not the boogeyman most people (myself included, until fairly recently) think it is. If you don't believe me, go check prices on Amazon.

The following thread contains the discussion on ECC:

http://forums.freenas.org/index.php?threads/ecc-vs-non-ecc-ram-and-zfs.15449/

AlainD · Aug 16, 2014

Hi

I'm well aware of what can happen if zfs is not using ecc-ram, although I find it an design fault (there's no ecc inside most CPU's for instance).
I'm not talking about a way to drop ECC-ram, I'm using it also (and considered moving to a xeon on my main workstation for it).
I'm talking for those that have ECC-ram, but are still not sure. It's very difficult to test ecc-ram and a regular test would give me less worries.

Ericloewe · Aug 16, 2014

You mean something like memtest86+?

The test is only useful if you can force a memory error - which you can't, not without endangering yourself and your hardware.

Intel CPUs have an obscure method for determining whether ECC is active, but AMD CPUs don't.

cyberjock · Aug 16, 2014

AlainD said:
Hi

I'm well aware of what can happen if zfs is not using ecc-ram, although I find it an design fault (there's no ecc inside most CPU's for instance).

That is accurate but also very inaccurate. Literally EVERY Intel CPU (including desktop CPUs AFAIK) that has L1/L2/L3 caches has parity for those caches in the last 5+ years (and probably more like 20+ years). In fact I believe we've had a handful of people that have had ECC cache errors reported due to CPU errata with virtualbox. If you read up on ECC RAM there are spacecraft that were launched in the mid 1990s with ECC RAM. This technology is not new by any stretch of the imagination. The CPU caches have parity because parity can identify corrupt bits but cannot correct for them. This isn't a big deal because the CPU cache should be nothing more than a copy of what is in RAM. In the event of a parity failure your OS should log the parity error and the CPU will simply retrieve a good copy of the data from RAM.

Now if AMD doesn't, that's totally their fail. I don't particularly like AMD and I'm not up to snuff to agree or disagree with what AMD's designs do. I'd consider this one of the stupidest things they could possibly do if their caches don't have some form of data protection. In fact, if they really are that stupid then they probably deserve to go out of business for that kind of blantant stupidity with regards to engineering principles.

AlainD said:
I'm not talking about a way to drop ECC-ram, I'm using it also (and considered moving to a xeon on my main workstation for it).
I'm talking for those that have ECC-ram, but are still not sure. It's very difficult to test ecc-ram and a regular test would give me less worries.

You aren't understanding the big picture. ECC RAM tests itself, automatically, when it is accessed. EVERY SINGLE bit of data that comes into the memory controller (which is the direct link between the RAM and everything else) does the ECC check if enabled and if supported with appropriate hardware. There is *nothing* that needs to be done on your part after that.

ECC scrubbing is for systems that are absurdly large and have huge sums of RAM that may not even be accessed every day. For those systems it's within the realm of statistical chance that you might have a single bit error that hasn't been found because you haven't read that RAM location in days, weeks, or maybe even years. *If* a second error were to occur in that block then you would be unable to correct it with ECC (remember, ECC can correct single-bit errors but only identify multi-bit errors). Scrubbing of the RAM has the system slowly go from the beginning to the end and literally read all the RAM. Why? Because if you have to read it then it must, therefore, be checked by the memory controller. The "hope" is that you correct single-bit errors (which you can correct) before they become multi-bit errors (which you can't correct and halts the system). ECC scrubbing is nothing more than a simple way to test RAM while the system is actually operating. There is a performance penalty, but for many situations it is more important to be reliable than to be fast.

If you are shooting for uptime measured in years then you absolutely want scrubbing. On the other hand if you are one of us mere mortals that has uptime measured in weeks or months because you *do* like to upgrade FreeNAS (and therefore a reboot will be necessary) then scrubbing is much less important. The likelihood of two bitflips in the same 64-bit row is extremely slim except if the RAM has suddenly failed outright. Of course, that kind of failure cannot be predicted and there is no amount of diagnostics that could ever identify or fix the problem anyway.

SirMaster · Aug 16, 2014

If you are really set on trying to use non-ECC RAM then there are some steps that you can take to help minimize problems caused by memory errors, like enabling the ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). What this flag does is checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.

I mean it's still not ideal, but it's better than using ZFS with the default checksum behavior if you really really insist on using non-ECC RAM for whatever reason.

You can see the core developer and main developer of ZFS mention it's use here:

http://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271

cyberjock · Aug 16, 2014

Hate to break it to you SirMaster, but it doesn't do what you think it does. ;)

SirMaster · Aug 16, 2014

It works like I think it does. I don't think you know what I think it does then. I have spoke with Matthew about it before in IRC. He wrote the subroutine that the flag invokes himself :)

It IS statistically *less* likely to cause data corruption on-disk from an error in RAM with it on than with it off. That's all I was trying to say.

cyberjock · Aug 16, 2014

Ok... keep telling yourself that. ;)

SirMaster · Aug 16, 2014

I'm not telling myself anything... I'm simply going off the source code and mathematical statistics which do not actually lie.

Unless Matthew was mistaken for some reason, but I would trust the guy who wrote ZFS over pretty much anyone else in the universe.

I've reviewed the subroutine myself in the past and I can clearly see what it changes.

Care to actually explain why it does not statistically reduce the likelihood of on-disk error given an equal probability of a memory error rather than just say "it doesn't do what you think it does". Saying that contributes absolutely nothing to the conversation...

I would be really interested to see proof as to why it wouldn't actually reduce the probability.

Knowltey · Aug 16, 2014

it doesn't do what you think it does

cyberjock · Aug 16, 2014

So let's say that you are 100% right. Let's say the odds improve by 1 in a billion(which is a major exaggeration, but go with it). So how much data is a billion bytes? "just" 1GB. So are you *really* going to argue that while adding "just" 9 decimals of protection that we'll assume came out of thin air by enabling an "unsupported" flag is a good idea? Really?

And you know who you *never* ask to solve a hardware problem? The software guy.

And do you know who you *never* ask to solve a software problem? A hardware guy.

You should realize something.. ECC is actually a major performance hit when you don't have dedicated silicone optimized to do that and nothing else. For it to do what you think it does the performance penalty would be enormous.

SirMaster · Aug 16, 2014

What does it do then? Enlighten me if I am making a mistake if you will. If you know that it does not do what I think it does then clearly you do know what it does do.

So please explain.

Knowltey · Aug 16, 2014

SirMaster said:
If you know that it does not do what I think it does then clearly you do know what it does do.

Logical fallacy. It is entirely possible to know that someone's car cannot do 400mph while not knowing that it's actual top speed as 134.6

You are saying this option goes 400mph. We're saying there is no way it is that effective, we just don't know exactly how effective, but definitely not as effective as you say it is.

SirMaster · Aug 16, 2014

If you have actually looked at the code (which would really be what you should be doing when it comes to using and especially understanding software) that the flag causes to run you will see how simplistic it is. There is basically nothing extra or complex in the alternate code path that would cause anything abnormal to go wrong just because it's "unsupported".

Knowltey · Aug 16, 2014

SirMaster said:
If you have actually looked at the code (which would really be what you should be doing when it comes to using and especially understanding software) that the flag causes to run you will see how simplistic it is. There is basically nothing extra or complex in the alternate code path that would cause anything abnormal to go wrong just because it's "unsupported".

If you have a memory error in your RAM with ZFS, the LAST thing you want to do is read it due to it possibly corrupting a corresponding disk spot as well by "correcting" the disk.

Your option is basically forcing it to be all read on a routine basis.

SirMaster · Aug 16, 2014

I never claimed how effective anything was.

All I said is that it reduces it, it may be only a very very small reduction in probability, but it's a reduction and that's all I ever though it did and all I ever said it did.

So for your statement that "it doesn't do what I though it does" to be correct, the flag would have to either reduce the probability of an on-disk error by exactly 0% or it would have to increase it. Otherwise it does actually do what I think it does, and that is reduce it, by some amount, any amount. So you have to know that it does not change it or that it increases it to call my original statement incorrect.

Knowltey · Aug 16, 2014

SirMaster said:
All I said is that it reduces it, it may be only a very very small reduction in probability, but it's a reduction and that's all I ever though it did and all I ever said it did.

Yeah, and we're saying it may not even do that and may even actually increase it.

cyberjock · Aug 16, 2014

Considering that if your non-ECC RAM goes bad it's not a single bit but a massive chain of them, unless you plan to have a solution that involved a couple dozen zeros of reduction in the probability, you haven't done anything worth of even examining. The outcome is still the same. A trashed unmountable pool and backups that are as good as trash. Considering this is the same result as not running the flag, what was gained? That's right.... not a damn thing.

All that crap I mentioned in the ECC vs non-ECC thread about how there is no substitute for ECC. Yep, I wrote that for a reason. You cannot solve with software a problem that is so interwoven with a hardware failure of this magnitude or randomness as to how it will strike and to what extent.

SirMaster · Aug 16, 2014

OK, that's fine if you want to make that claim, but you need to provide proof as to why it would not decrease it or increase it.

Because right now it's the source code itself plus also the main ZFS developer himself claiming that it should reduce it vs your word.

Can you explain why checksumming the data at rest and then verifying it a second time before committing it to disk is more likely or equally likely to cause an error than checksumming it as it passes through memory and not verifying it a second time?

Important Announcement for the TrueNAS Community.

RAM memory scrub option?

Contributor

Server Wrangler

Contributor

Server Wrangler

Inactive Account

Patron

Inactive Account

Patron

Inactive Account

Patron

Patron

Inactive Account

Patron

Patron

Patron

Patron

Patron

Patron

Inactive Account

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "RAM memory scrub option?"

Similar threads