4tb drives x 8 or 10 with raidz2?

Status
Not open for further replies.

diedrichg

Wizard
Joined
Dec 4, 2012
Messages
1,319

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Um...

There's no fifth-gens, but there are a handful of fourth-gens...

Okay, they do exist, but with the exception of the 4570TE those are all FCBGA embedded parts, he's not just going to drop that into a socket. The 4570TE is a mobile chip and probably won't be supported well on a desktop/server board, and is probably going to cost more than an equivalent Xeon or i3.

Moral of the story is "buy something with ECC support."
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Could have sworn I saw 4GB as the minimum back on the 9.1 branch, but looking at the old docs I see 6GB as far back as 8.0.1

I believe it was around 8.2 where I had been watching the situation for a long while and came to the realization that we weren't seeing the problems with 8GB systems that appeared to be plaguing the 4GB and 6GB platforms. I talked it over with some of the others here on the forums and we agreed that despite not really understanding the WHY of it, nobody here wanted to tell people that something we knew to be eating pools was recommended. So I also changed it in the FreeNAS Handbook at that time.
 

bbox

Dabbler
Joined
Mar 29, 2016
Messages
17
There actually can be a waste of space if you use ashift=12 (which most people with modern consumer disks should be using).

See this with detail: https://web.archive.org/web/2014040...s.org/ritk/zfs-4k-aligned-space-overhead.html

For example, with 12x4TB RAIDZ2 there is a waste of 2.91TiB due to alignment padding and allocation overhead.


Though I should also note that there are ways to deal with this. For example, if you change the recordsize on your datasets (from the default 128KiB) to 1 MiB and you make sure to have compression enabled. Then you can avoid pretty much all of this alignment padding and allocation overhead.

But if you stick with 128 KiB blocks and ashift=12 then you will always have some amount of wasted space unless you are using 6-disks or 18 disks in your RAIDZ2. with 6 and 18 disks the alignment padding and allocation overhead are zero and the only overhead comes from the reserved space for metadata which will always, alawys be there.

I should also note that adding another disk to your vdev will always increase your space. You never lose space by adding a disk, but for some number of disks in your vdev you don't gain as much by adding 1 more disk as you do with other numbers of disks in your vdev.

Very interesting. So, if I change my vdev's record size to 1024K instead of Inherit (via Edit Options), there should be no lost space anymore? Less at least?
My pool consists of 8x2TB hard drives formated with one vdev in RAIDZ2. Thanks.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
Very interesting. So, if I change my vdev's record size to 1024K instead of Inherit (via Edit Options), there should be no lost space any more? Less at least?
My pool consists of 8x2TB hard drives formated with one vdev in RAIDZ2. Thanks.

Dataset, not vdev. Vdevs don't have settings or properties.

Here I set up a real example on my ZFS. Note, that my zpool is a 12-disk RAIDZ2.

According to a spreadsheet I made, a 12-disk RAIDZ2 configuration has a 9.375% overhead in 128K recordsize, and a 0.586% overhead in 1M recordsize.

https://docs.google.com/spreadsheet...J-Dc4ZcwUdt6fkCjpnXxAEFlyA/edit#gid=804965548

Code:
root@nick-server:/# zfs get recordsize nickarrayold/test/128k
NAME					PROPERTY	VALUE	SOURCE
nickarrayold/test/128k  recordsize  128K	 local
root@nick-server:/# zfs get recordsize nickarrayold/test/1M
NAME				  PROPERTY	VALUE	SOURCE
nickarrayold/test/1M  recordsize  1M	   local
root@nick-server:/# cd /nickarrayold/test/128k/
root@nick-server:/nickarrayold/test/128k# ls -la
total 1024483
drwxr-xr-x 2 root root		  3 Mar 11 21:03 .
drwxr-xr-x 4 root root		  5 Mar 11 21:02 ..
-rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin
root@nick-server:/nickarrayold/test/128k# du
1024483 .
root@nick-server:/nickarrayold/test/128k# du --apparent-size
1024001 .
root@nick-server:/nickarrayold/test/128k# cd ..
root@nick-server:/nickarrayold/test# cd 1M/
root@nick-server:/nickarrayold/test/1M# ls -la
total 941578
drwxr-xr-x 2 root root		  3 Mar 11 21:03 .
drwxr-xr-x 4 root root		  5 Mar 11 21:02 ..
-rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin
root@nick-server:/nickarrayold/test/1M# du
941577  .
root@nick-server:/nickarrayold/test/1M# du --apparent-size
1024001 .
root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/128k
NAME					 USED  AVAIL  REFER  MOUNTPOINT
nickarrayold/test/128k  1001M   124G  1001M  /nickarrayold/test/128k
root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/1M
NAME				   USED  AVAIL  REFER  MOUNTPOINT
nickarrayold/test/1M   920M   124G   920M  /nickarrayold/test/1M


As you can see, I write a 1GB file to the 128K recordsize dataset, and it takes up USED=1001M. But I write it to a 1M recordsize dataset and it only takes up USED=920M of space.

The file is still 1GB either way and you can see the differences reported also by du and du --apparent-size being the same.

Remember that ZFS assumes 128K recordsize, so upon zpool creation, it already as reduced the capacity of the pool to assume you are writing 128K records to it. This is why using more efficient 1M recordsizes makes the files appear to take up less space than how large they really are. The end result here is I can store over 8% more data on my pool when the files are stored on 1M recordsize. If my pool is 32TB capacity, that's an extra 2.56TB of space to use.
 

bbox

Dabbler
Joined
Mar 29, 2016
Messages
17
Dataset, not vdev. Vdevs don't have settings or properties.

Here I set up a real example on my ZFS. Note, that my zpool is a 12-disk RAIDZ2.

According to a spreadsheet I made, a 12-disk RAIDZ2 configuration has a 9.375% overhead in 128K recordsize, and a 0.586% overhead in 1M recordsize.

https://docs.google.com/spreadsheet...J-Dc4ZcwUdt6fkCjpnXxAEFlyA/edit#gid=804965548

Code:
root@nick-server:/# zfs get recordsize nickarrayold/test/128k
NAME					PROPERTY	VALUE	SOURCE
nickarrayold/test/128k  recordsize  128K	 local
root@nick-server:/# zfs get recordsize nickarrayold/test/1M
NAME				  PROPERTY	VALUE	SOURCE
nickarrayold/test/1M  recordsize  1M	   local
root@nick-server:/# cd /nickarrayold/test/128k/
root@nick-server:/nickarrayold/test/128k# ls -la
total 1024483
drwxr-xr-x 2 root root		  3 Mar 11 21:03 .
drwxr-xr-x 4 root root		  5 Mar 11 21:02 ..
-rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin
root@nick-server:/nickarrayold/test/128k# du
1024483 .
root@nick-server:/nickarrayold/test/128k# du --apparent-size
1024001 .
root@nick-server:/nickarrayold/test/128k# cd ..
root@nick-server:/nickarrayold/test# cd 1M/
root@nick-server:/nickarrayold/test/1M# ls -la
total 941578
drwxr-xr-x 2 root root		  3 Mar 11 21:03 .
drwxr-xr-x 4 root root		  5 Mar 11 21:02 ..
-rw-r--r-- 1 root root 1048576000 Mar 11 21:03 1GB.bin
root@nick-server:/nickarrayold/test/1M# du
941577  .
root@nick-server:/nickarrayold/test/1M# du --apparent-size
1024001 .
root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/128k
NAME					 USED  AVAIL  REFER  MOUNTPOINT
nickarrayold/test/128k  1001M   124G  1001M  /nickarrayold/test/128k
root@nick-server:/nickarrayold/test/1M# zfs list nickarrayold/test/1M
NAME				   USED  AVAIL  REFER  MOUNTPOINT
nickarrayold/test/1M   920M   124G   920M  /nickarrayold/test/1M


As you can see, I write a 1GB file to the 128K recordsize dataset, and it takes up USED=1001M. But I write it to a 1M recordsize dataset and it only takes up USED=920M of space.

The file is still 1GB either way and you can see the differences reported also by du and du --apparent-size being the same.

Remember that ZFS assumes 128K recordsize, so upon zpool creation, it already as reduced the capacity of the pool to assume you are writing 128K records to it. This is why using more efficient 1M recordsizes makes the files appear to take up less space than how large they really are. The end result here is I can store over 8% more data on my pool when the files are stored on 1M recordsize. If my pool is 32TB capacity, that's an extra 2.56TB of space to use.

Thanks a lot for a comprehensive example, spreadsheet and remarks. It is helpful and I will definitely test writing some large files to check the actual space they are taking in datasets formated to 1M recordsize.

Meanwhile in GUI, should the space availability number change after changing dataset's recordsize? In my case it did not:

Screen Shot 2018-03-18 at 16.25.52.png


P. S. My confusion in terminology comes from what I see in FreeNAS GUI as well. While refering to @cyberjock's FreeNAS Guide 9.10, I was boldly sure that by creating a pool/volume GUI formats vdev (in raidz1/2/3 etc.) and datasets goes inside the vdev. Apparently not.
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
Meanwhile in GUI, should the space availability number change after changing dataset's recordsize? In my case it did not:

View attachment 23466

No, it's not quite that simple. A dataset can have files of multiple recordsizes at once. When you change the recordsize of the dataset, it doesn't change the existing data on the dataset, it only changes how new incoming files will be written. And you could then change it again later. Either way, the ZFS developers decided to always do space/free calculation with a recordsize of 128K.

P. S. My confusion in terminology comes from what I see in FreeNAS GUI as well. While refering to @cyberjock's FreeNAS Guide 9.10, I was boldly sure that by creating a pool/volume GUI formats vdev (in raidz1/2/3 etc.) and datasets goes inside the vdev. Apparently not.


At the base are your disks. Disks are assembled into vdevs, and vdevs are merged together into a zpool. Datasets are virtual slices of the zpool.
 
Last edited:

bbox

Dabbler
Joined
Mar 29, 2016
Messages
17
Thanks again.

But if you stick with 128 KiB blocks and ashift=12 then you will always have some amount of wasted space unless you are using 6-disks or 18 disks in your RAIDZ2. with 6 and 18 disks the alignment padding and allocation overhead are zero and the only overhead comes from the reserved space for metadata which will always, alawys be there.

What size reserved space should be expected in 6-disks raidz2 configuration?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Can you define "reserved space" please? because it can be a lot of things.
 

bbox

Dabbler
Joined
Mar 29, 2016
Messages
17
Therefore I was quoting @SirMaster post. I can only speculate.
My main concern is that I see 7TiB of available space in 6x2TB disk raidz2 configuration. It is 7.696TB, which is more than 300GB short, given the fact that overhead size is 0 in this particular situation. Is that the mentioned "reserved space", or even the most optimal configuration has some wasted space, like they say in this post?
Btw, I see slightly different numbers in shell (7.1TiB).

Also, freshly formated pool w/o any shares is already filled with 701 MiB of some kind of data, and I saw it gradually moving up. Is that metadata?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Oh sorry, didn't saw it wasn't my post in the quote.

So yeah, the reserved space for metadata is 1/64 of the total space, so about 1.6 %. That's whatever your configuration is, as @SirMaster said, it'll always be here.

Also, freshly formated pool w/o any shares is already filled with 701 MiB of some kind of data, and I saw it gradually moving up. Is that metadata?

I guess it's the system dataset + the snapshots you're seeing here, but it's only a guess, I don't have enough info to be more precise/sure.
 
Last edited:

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
While searching for the metadata/CoW overhead numbers to understand this 1/32 and 1/64 mess I found this post: https://forums.freenas.org/index.ph...act-checksum-size-overhead.28187/#post-183802 and it's you who told me there's 1/64 of the space reserved for metadata so now I'm lost.

Now I also wonder why you can't delete files if you fill your pool to 100 %... because if there's reserved space for the CoW you should be able to do it, no?
 

SirMaster

Patron
Joined
Mar 19, 2014
Messages
241
While searching for the metadata/CoW overhead numbers to understand this 1/32 and 1/64 mess I found this post: https://forums.freenas.org/index.ph...act-checksum-size-overhead.28187/#post-183802 and it's you who told me there's 1/64 of the space reserved for metadata so now I'm lost.

Now I also wonder why you can't delete files if you fill your pool to 100 %... because if there's reserved space for the CoW you should be able to do it, no?

Actually I meant to link this line:
https://github.com/freebsd/freebsd/...opensolaris/uts/common/fs/zfs/spa_misc.c#L397

It has a longer description of the setting.

It used to be 1/64, and was increased to 1/32.

You can see the commit here when it was changed:
https://github.com/freebsd/freebsd/...bfebce2#diff-87835b08b398201fc148599fb1cba189

I realize this shows it was changed before my post mentioning 1/64 that you are referencing. This just shows the change in FreeBSD codebase though, not when it was actually used in a stable release build of freebsd or when the code was pulled into freenas and released.

Also, I know that ZFSonLinux which I use myself was 1/64 back in 2015 when I wrote that other post, but got the code change upstream from Illumos to change to 1/32 sometime later.

This change should have made deleting files from a 100% full pool much more likely to succeed. I ran into a 100% full pool recently and was able to delete some files just fine FWIW.

Also note that this spa_slop_shift value is user configurable. Default is currently 5, as in 1/2^5 =(1/32), but you could change it to 6 , so 1/2^6 =(1/64), and this will change your pool's capacity accordingly.
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
Ok, now I understand, thanks ;)

I'll add to change the value and the name to the todo list of the calculator.
 
Status
Not open for further replies.
Top