Purposefully corrupt a disk to test self healing

Stromkompressor · Apr 21, 2023

I have a pretty fresh TrueNAS Scale installation and want to test it and put it under some load before trusting it my data (ofc backups will also be done).

My configuration is RAIDZ2 with 4 x 6 TB HDDs. I have generated several 200 - 300 GB large files with "cat /dev/random > file.out". Right now I am copying all these files again and again to fill up my ~10 TB pool to at least 50 %.

Once I got some data, I plan to calculate checksums for all files. I think I'll use cksum instead of md5sum because it might be faster but I'll verify that before doing this. After all the I/O might be the bottle neck here. Then I want to "trash" one disk. I don't know the consequences of plainly "dd if=/dev/random of=/dev/sda" while TrueNAS is running. That's why I think I should get that one drive out and do the dd on another computer (ofc not sda in this case).
Then I put the disk back into my NAS and will see what TrueNAS thinks of it and explore a little bit how it looks like. I will try to make the disk "self heal", pay attention on how long it takes and then will verify that all files match the prerecorded checksums.

Is there anything "wrong" with this test? Once that test was successful I want to do it with 2 (other) disks to see if RAIDZ2 holds up to its promises.

I trust ZSF and TrueNAS but now I got the opportunity to do all this and it puts some load on the disks which will be great for "burning" them in.

danb35 · Apr 21, 2023

Stromkompressor said:
Once I got some data, I plan to calculate checksums for all files

Since ZFS calculates and stores checksums of every block of data and metadata, this seems entirely unnecessary, but I guess if you want to, go ahead.

WI_Hedgehog · Apr 21, 2023

I did a bunch of tricky testing stuff with a hot-swap SAS system and TrueNAS alerted right away with an accurate error (unlike Windows Server), kept running fine (RAID-Z3), and self-healed depending on what had been done to it. Good stuff.

Arwen · Apr 22, 2023

Their are tricky things you can do for testing and recovery.

Yes, you can do the dd if=/dev/random bs=64k of=/dev/sda2 live. ZFS won't detect the bad data unless it is being read or a pool scrub is performed.

NOTE: TrueNAS uses partitioned drives, so you MUST use the pool data partition. Otherwise you wipe the partition table and then have to really perform a disk replacement.

Their are 2 ways to "fix" this "bit rot":
- Perform pool scrub
- Read all the files, like below
find /mnt/POOL -type f -print -exec dd if={} bs=64k of=/dev/null \;

Both should cause the output of zpool status POOL to show the fixes. You can watch the pool counters go up using;
watch -n 10 zpool status POOL

One word of caution, if ZFS detects too many errors on a device, it may toss the disk out of the pool and declare it failed. I don't know the threshold, nor what triggers it may use.

Edit:
A quick test on a TrueNAS SCALE VM indicates that this test methodology works. The induced corruption shows as checksum errors, which are auto-corrected if the file is read.

But, to get rid of all induced corruption, you have to run a pool scrub. This is due to metadata, (like directory entries), having 2 copies. In my case of a mirrored pair, that ends up as 4 copies. Since a simple read would only need access to 1 copy, the other 3 need to be verified with a pool scrub.

Stromkompressor · Apr 22, 2023

Arwen said:
NOTE: TrueNAS uses partitioned drives, so you MUST use the pool data partition. Otherwise you wipe the partition table and then have to really perform a disk replacement.

Ok, then I'll do that with the pool data partition first. But after that I also would like to simulate a disk replacement. Basically I want to try this guide: https://www.truenas.com/docs/core/coretutorials/storage/disks/diskreplace/ It is for Core but should work on Scale too I guess.

So:
1. Wipe the partition table
2. TrueNAS recognizes that this disk is trash
3. Take it offline
4. Replace it with itself? Is that even possible?

Arwen said:
One word of caution, if ZFS detects too many errors on a device, it may toss the disk out of the pool and declare it failed. I don't know the threshold, nor what triggers it may use.

Once it tosses out the disk, it should be the same procedure with replacing a disk like above, right?

Stromkompressor · Apr 23, 2023

Documenting my progress:

The threshold until it says "Too many errors" is around 10 - 20 errors.
At some point ZFS (or TrueNAS?) started an automatic scrub even though my next scheduled scrub would be in 6 days. Maybe this was triggered by me offlining and onlining disks.
The amount of errors is enormous (currently 400.000 but quickly growing). Previously, I overwrote 2.4 TB of sda2 with garbage.
Currently the estimated time until the scrub is done is 19 hours but this is decreasing quickly.
All alerts were instantly sent via mail and were accurate.

Cool to see when it's doing some work :)

Edit: Currently at 5.000.000 errors :)

Arwen · Apr 23, 2023

Yes, ZFS will re-sync / scrub the pool for a disk that is off-lined and then on-lined. In theory, this is a quick re-sync, because ZFS should know the time it was off-lined and what changes were made. But, that is too detailed for me to bother with. And your use case of corrupting the disk may have caused it to really need to do more work, thus a full disk replacement or full pool scrub.

Important Announcement for the TrueNAS Community.

Purposefully corrupt a disk to test self healing

Stromkompressor

Dabbler

danb35

Hall of Famer

WI_Hedgehog

Guru

Arwen

MVP

Stromkompressor

Dabbler

Stromkompressor

Dabbler

Arwen

MVP

Similar threads

Important Announcement for the TrueNAS Community.

Purposefully corrupt a disk to test self healing

Stromkompressor

Dabbler

danb35

Hall of Famer

WI_Hedgehog

Guru

Arwen

MVP

Stromkompressor

Dabbler

Stromkompressor

Dabbler

Arwen

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Purposefully corrupt a disk to test self healing"

Similar threads