4tb drives x 8 or 10 with raidz2?

diedrichg · Sep 30, 2015

Reference this spreadsheet I created to get your answer about # drives optimization: https://docs.google.com/spreadsheets/d/1wi-4CN2opzw2yAiM9JHC5VAiT9FFIBxhVECBk73jH1Y/htmlview

HoneyBadger · Sep 30, 2015

ewhac said:
Um...

There's no fifth-gens, but there are a handful of fourth-gens...

Okay, they do exist, but with the exception of the 4570TE those are all FCBGA embedded parts, he's not just going to drop that into a socket. The 4570TE is a mobile chip and probably won't be supported well on a desktop/server board, and is probably going to cost more than an equivalent Xeon or i3.

Moral of the story is "buy something with ECC support."

jgreco · Sep 30, 2015

HoneyBadger said:
Could have sworn I saw 4GB as the minimum back on the 9.1 branch, but looking at the old docs I see 6GB as far back as 8.0.1

I believe it was around 8.2 where I had been watching the situation for a long while and came to the realization that we weren't seeing the problems with 8GB systems that appeared to be plaguing the 4GB and 6GB platforms. I talked it over with some of the others here on the forums and we agreed that despite not really understanding the WHY of it, nobody here wanted to tell people that something we knew to be eating pools was recommended. So I also changed it in the FreeNAS Handbook at that time.

bbox · Mar 11, 2018

SirMaster said:
There actually can be a waste of space if you use ashift=12 (which most people with modern consumer disks should be using).

See this with detail: https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html

For example, with 12x4TB RAIDZ2 there is a waste of 2.91TiB due to alignment padding and allocation overhead.

Though I should also note that there are ways to deal with this. For example, if you change the recordsize on your datasets (from the default 128KiB) to 1 MiB and you make sure to have compression enabled. Then you can avoid pretty much all of this alignment padding and allocation overhead.

But if you stick with 128 KiB blocks and ashift=12 then you will always have some amount of wasted space unless you are using 6-disks or 18 disks in your RAIDZ2. with 6 and 18 disks the alignment padding and allocation overhead are zero and the only overhead comes from the reserved space for metadata which will always, alawys be there.

I should also note that adding another disk to your vdev will always increase your space. You never lose space by adding a disk, but for some number of disks in your vdev you don't gain as much by adding 1 more disk as you do with other numbers of disks in your vdev.

Very interesting. So, if I change my vdev's record size to 1024K instead of Inherit (via Edit Options), there should be no lost space anymore? Less at least?
My pool consists of 8x2TB hard drives formated with one vdev in RAIDZ2. Thanks.

Bidule0hm · Mar 11, 2018

Yes, please see this thread for more details https://forums.freenas.org/index.php?threads/misaligned-pools-and-lost-space.40288/

SirMaster · Mar 11, 2018

bbox said:
Very interesting. So, if I change my vdev's record size to 1024K instead of Inherit (via Edit Options), there should be no lost space any more? Less at least?
My pool consists of 8x2TB hard drives formated with one vdev in RAIDZ2. Thanks.

Dataset, not vdev. Vdevs don't have settings or properties.

Here I set up a real example on my ZFS. Note, that my zpool is a 12-disk RAIDZ2.

According to a spreadsheet I made, a 12-disk RAIDZ2 configuration has a 9.375% overhead in 128K recordsize, and a 0.586% overhead in 1M recordsize.

https://docs.google.com/spreadsheet...J-Dc4ZcwUdt6fkCjpnXxAEFlyA/edit#gid=804965548

Code:

root@nick-server:/# zfs get recordsize nickarrayold/test/128k
NAME					PROPERTY	VALUE	SOURCE
nickarrayold/test/128k  recordsize  128K	 local
root@nick-server:/# zfs get recordsize nickarrayold/test/1M
NAME				  PROPERTY	VALUE	SOURCE
nickarrayold/test/1M  recordsize  1M	   local
root@nick-server:/# cd /nickarrayold/test/128k/
root@nick-server:/nickarrayold/test/128k# ls -la
total 1024483
drwxr-xr-x 2 root root		  3 Mar 11 21:03 .
drwxr-xr-x 4 root root		  5 Mar 11 21:02 ..
-rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin
root@nick-server:/nickarrayold/test/128k# du
1024483 .
root@nick-server:/nickarrayold/test/128k# du --apparent-size
1024001 .
root@nick-server:/nickarrayold/test/128k# cd ..
root@nick-server:/nickarrayold/test# cd 1M/
root@nick-server:/nickarrayold/test/1M# ls -la
total 941578
drwxr-xr-x 2 root root		  3 Mar 11 21:03 .
drwxr-xr-x 4 root root		  5 Mar 11 21:02 ..
-rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin
root@nick-server:/nickarrayold/test/1M# du
941577  .
root@nick-server:/nickarrayold/test/1M# du --apparent-size
1024001 .
root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/128k
NAME					 USED  AVAIL  REFER  MOUNTPOINT
nickarrayold/test/128k  1001M   124G  1001M  /nickarrayold/test/128k
root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/1M
NAME				   USED  AVAIL  REFER  MOUNTPOINT
nickarrayold/test/1M   920M   124G   920M  /nickarrayold/test/1M

As you can see, I write a 1GB file to the 128K recordsize dataset, and it takes up USED=1001M. But I write it to a 1M recordsize dataset and it only takes up USED=920M of space.

The file is still 1GB either way and you can see the differences reported also by du and du --apparent-size being the same.

Remember that ZFS assumes 128K recordsize, so upon zpool creation, it already as reduced the capacity of the pool to assume you are writing 128K records to it. This is why using more efficient 1M recordsizes makes the files appear to take up less space than how large they really are. The end result here is I can store over 8% more data on my pool when the files are stored on 1M recordsize. If my pool is 32TB capacity, that's an extra 2.56TB of space to use.

bbox · Mar 18, 2018

SirMaster said:
Dataset, not vdev. Vdevs don't have settings or properties.

Here I set up a real example on my ZFS. Note, that my zpool is a 12-disk RAIDZ2.

According to a spreadsheet I made, a 12-disk RAIDZ2 configuration has a 9.375% overhead in 128K recordsize, and a 0.586% overhead in 1M recordsize.

https://docs.google.com/spreadsheet...J-Dc4ZcwUdt6fkCjpnXxAEFlyA/edit#gid=804965548

Code:
root@nick-server:/# zfs get recordsize nickarrayold/test/128k NAME PROPERTY VALUE SOURCE nickarrayold/test/128k recordsize 128K local root@nick-server:/# zfs get recordsize nickarrayold/test/1M NAME PROPERTY VALUE SOURCE nickarrayold/test/1M recordsize 1M local root@nick-server:/# cd /nickarrayold/test/128k/ root@nick-server:/nickarrayold/test/128k# ls -la total 1024483 drwxr-xr-x 2 root root 3 Mar 11 21:03 . drwxr-xr-x 4 root root 5 Mar 11 21:02 .. -rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin root@nick-server:/nickarrayold/test/128k# du 1024483 . root@nick-server:/nickarrayold/test/128k# du --apparent-size 1024001 . root@nick-server:/nickarrayold/test/128k# cd .. root@nick-server:/nickarrayold/test# cd 1M/ root@nick-server:/nickarrayold/test/1M# ls -la total 941578 drwxr-xr-x 2 root root 3 Mar 11 21:03 . drwxr-xr-x 4 root root 5 Mar 11 21:02 .. -rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin root@nick-server:/nickarrayold/test/1M# du 941577 . root@nick-server:/nickarrayold/test/1M# du --apparent-size 1024001 . root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/128k NAME USED AVAIL REFER MOUNTPOINT nickarrayold/test/128k 1001M 124G 1001M /nickarrayold/test/128k root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/1M NAME USED AVAIL REFER MOUNTPOINT nickarrayold/test/1M 920M 124G 920M /nickarrayold/test/1M

As you can see, I write a 1GB file to the 128K recordsize dataset, and it takes up USED=1001M. But I write it to a 1M recordsize dataset and it only takes up USED=920M of space.

The file is still 1GB either way and you can see the differences reported also by du and du --apparent-size being the same.

Remember that ZFS assumes 128K recordsize, so upon zpool creation, it already as reduced the capacity of the pool to assume you are writing 128K records to it. This is why using more efficient 1M recordsizes makes the files appear to take up less space than how large they really are. The end result here is I can store over 8% more data on my pool when the files are stored on 1M recordsize. If my pool is 32TB capacity, that's an extra 2.56TB of space to use.

Thanks a lot for a comprehensive example, spreadsheet and remarks. It is helpful and I will definitely test writing some large files to check the actual space they are taking in datasets formated to 1M recordsize.

Meanwhile in GUI, should the space availability number change after changing dataset's recordsize? In my case it did not:

P. S. My confusion in terminology comes from what I see in FreeNAS GUI as well. While refering to @cyberjock's FreeNAS Guide 9.10, I was boldly sure that by creating a pool/volume GUI formats vdev (in raidz1/2/3 etc.) and datasets goes inside the vdev. Apparently not.

SirMaster · Mar 18, 2018

bbox said:
Meanwhile in GUI, should the space availability number change after changing dataset's recordsize? In my case it did not:

View attachment 23466

No, it's not quite that simple. A dataset can have files of multiple recordsizes at once. When you change the recordsize of the dataset, it doesn't change the existing data on the dataset, it only changes how new incoming files will be written. And you could then change it again later. Either way, the ZFS developers decided to always do space/free calculation with a recordsize of 128K.

bbox said:
P. S. My confusion in terminology comes from what I see in FreeNAS GUI as well. While refering to @cyberjock's FreeNAS Guide 9.10, I was boldly sure that by creating a pool/volume GUI formats vdev (in raidz1/2/3 etc.) and datasets goes inside the vdev. Apparently not.

At the base are your disks. Disks are assembled into vdevs, and vdevs are merged together into a zpool. Datasets are virtual slices of the zpool.

bbox · Apr 2, 2018

Thanks again.

SirMaster said:
But if you stick with 128 KiB blocks and ashift=12 then you will always have some amount of wasted space unless you are using 6-disks or 18 disks in your RAIDZ2. with 6 and 18 disks the alignment padding and allocation overhead are zero and the only overhead comes from the reserved space for metadata which will always, alawys be there.

What size reserved space should be expected in 6-disks raidz2 configuration?

Bidule0hm · Apr 3, 2018

Can you define "reserved space" please? because it can be a lot of things.

bbox · Apr 3, 2018

Therefore I was quoting @SirMaster post. I can only speculate.
My main concern is that I see 7TiB of available space in 6x2TB disk raidz2 configuration. It is 7.696TB, which is more than 300GB short, given the fact that overhead size is 0 in this particular situation. Is that the mentioned "reserved space", or even the most optimal configuration has some wasted space, like they say in this post?
Btw, I see slightly different numbers in shell (7.1TiB).

Also, freshly formated pool w/o any shares is already filled with 701 MiB of some kind of data, and I saw it gradually moving up. Is that metadata?

Bidule0hm · Apr 3, 2018

Oh sorry, didn't saw it wasn't my post in the quote.

So yeah, the reserved space for metadata is 1/64 of the total space, so about 1.6 %. That's whatever your configuration is, as @SirMaster said, it'll always be here.

bbox said:
Also, freshly formated pool w/o any shares is already filled with 701 MiB of some kind of data, and I saw it gradually moving up. Is that metadata?

I guess it's the system dataset + the snapshots you're seeing here, but it's only a guess, I don't have enough info to be more precise/sure.

SirMaster · Apr 4, 2018

Bidule0hm said:
Oh sorry, didn't saw it wasn't my post in the quote.

So yeah, the reserved space for metadata is 1/64 of the total space, so about 1.6 %.

As far as I know, 1/32 is reserved for copy on write transactions, not for metadata, so 3.125% of space.

https://github.com/freebsd/freebsd/...pensolaris/uts/common/fs/zfs/spa_misc.c#L1821

Bidule0hm · Apr 4, 2018

While searching for the metadata/CoW overhead numbers to understand this 1/32 and 1/64 mess I found this post: https://forums.freenas.org/index.ph...act-checksum-size-overhead.28187/#post-183802 and it's you who told me there's 1/64 of the space reserved for metadata so now I'm lost.

Now I also wonder why you can't delete files if you fill your pool to 100 %... because if there's reserved space for the CoW you should be able to do it, no?

SirMaster · Apr 4, 2018

Bidule0hm said:
While searching for the metadata/CoW overhead numbers to understand this 1/32 and 1/64 mess I found this post: https://forums.freenas.org/index.ph...act-checksum-size-overhead.28187/#post-183802 and it's you who told me there's 1/64 of the space reserved for metadata so now I'm lost.

Now I also wonder why you can't delete files if you fill your pool to 100 %... because if there's reserved space for the CoW you should be able to do it, no?

Actually I meant to link this line:
https://github.com/freebsd/freebsd/...opensolaris/uts/common/fs/zfs/spa_misc.c#L397

It has a longer description of the setting.

It used to be 1/64, and was increased to 1/32.

You can see the commit here when it was changed:
https://github.com/freebsd/freebsd/...bfebce2#diff-87835b08b398201fc148599fb1cba189

I realize this shows it was changed before my post mentioning 1/64 that you are referencing. This just shows the change in FreeBSD codebase though, not when it was actually used in a stable release build of freebsd or when the code was pulled into freenas and released.

Also, I know that ZFSonLinux which I use myself was 1/64 back in 2015 when I wrote that other post, but got the code change upstream from Illumos to change to 1/32 sometime later.

This change should have made deleting files from a 100% full pool much more likely to succeed. I ran into a 100% full pool recently and was able to delete some files just fine FWIW.

Also note that this spa_slop_shift value is user configurable. Default is currently 5, as in 1/2^5 =(1/32), but you could change it to 6 , so 1/2^6 =(1/64), and this will change your pool's capacity accordingly.

Bidule0hm · Apr 4, 2018

Ok, now I understand, thanks ;)

I'll add to change the value and the name to the todo list of the calculator.

Important Announcement for the TrueNAS Community.

4tb drives x 8 or 10 with raidz2?

diedrichg

Wizard

HoneyBadger

actually does care

jgreco

Resident Grinch

bbox

Dabbler

Bidule0hm

Server Electronics Sorcerer

SirMaster

Patron

bbox

Dabbler

SirMaster

Patron

bbox

Dabbler

Bidule0hm

Server Electronics Sorcerer

bbox

Dabbler

Bidule0hm

Server Electronics Sorcerer

SirMaster

Patron

Bidule0hm

Server Electronics Sorcerer

SirMaster

Patron

Bidule0hm

Server Electronics Sorcerer

Similar threads

Important Announcement for the TrueNAS Community.

4tb drives x 8 or 10 with raidz2?

Wizard

actually does care

Resident Grinch

Dabbler

Server Electronics Sorcerer

Patron

Dabbler

Patron

Dabbler

Server Electronics Sorcerer

Dabbler

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Patron

Server Electronics Sorcerer

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "4tb drives x 8 or 10 with raidz2?"

Similar threads