Largeblocks?

mattlach · Aug 26, 2015

Hey all,

Lately I have been hearing about an ZFS option that can be set on the dataset level that tells ZFS to use larger 1MB blocks.

I have been told it can improve performance significantly when dealing with large files (but worsen it when dealing with small files).

I have googled the crap out of this without finding any further information.

Is this feature implemented in the FreeBSD port of OpenZFS? If so, can we set it in FreeNAS? (I'd imagine we'd use the "zfs set" command from the CLI)

Are the performance improvements really worth writing home about?

Among other things, my media folder is stored on my FreeNAS server. It contains many multi GB video files, and I'm wondering if this will help improve the performance when playing them on a networked frontend via NFS.

Thanks,
Matt

mattlach · Aug 26, 2015

Well, ZFS set suggests it is absent :(

Code:

The following properties are supported:

        PROPERTY       EDIT  INHERIT   VALUES

        available        NO       NO   <size>
        clones           NO       NO   <dataset>[,...]
        compressratio    NO       NO   <1.00x or higher if compressed>
        creation         NO       NO   <date>
        defer_destroy    NO       NO   yes | no
        logicalreferenced  NO       NO   <size>
        logicalused      NO       NO   <size>
        mounted          NO       NO   yes | no
        origin           NO       NO   <snapshot>
        refcompressratio  NO       NO   <1.00x or higher if compressed>
        referenced       NO       NO   <size>
        type             NO       NO   filesystem | volume | snapshot | bookmark
        used             NO       NO   <size>
        usedbychildren   NO       NO   <size>
        usedbydataset    NO       NO   <size>
        usedbyrefreservation  NO       NO   <size>
        usedbysnapshots  NO       NO   <size>
        userrefs         NO       NO   <count>
        written          NO       NO   <size>
        aclinherit      YES      YES   discard | noallow | restricted | passthrough | passthrough-x
        aclmode         YES      YES   discard | groupmask | passthrough | restricted
        atime           YES      YES   on | off
        canmount        YES       NO   on | off | noauto
        casesensitivity  NO      YES   sensitive | insensitive | mixed
        checksum        YES      YES   on | off | fletcher2 | fletcher4 | sha256
        compression     YES      YES   on | off | lzjb | gzip | gzip-[1-9] | zle | lz4
        copies          YES      YES   1 | 2 | 3
        dedup           YES      YES   on | off | verify | sha256[,verify]
        devices         YES      YES   on | off
        exec            YES      YES   on | off
        filesystem_count YES       NO   <count>
        filesystem_limit YES       NO   <count> | none
        jailed          YES      YES   on | off
        logbias         YES      YES   latency | throughput
        mlslabel        YES      YES   <sensitivity label>
        mountpoint      YES      YES   <path> | legacy | none
        nbmand          YES      YES   on | off
        normalization    NO      YES   none | formC | formD | formKC | formKD
        primarycache    YES      YES   all | none | metadata
        quota           YES       NO   <size> | none
        readonly        YES      YES   on | off
        recordsize      YES      YES   512 to 1M, power of 2
        redundant_metadata YES      YES   all | most
        refquota        YES       NO   <size> | none
        refreservation  YES       NO   <size> | none
        reservation     YES       NO   <size> | none
        secondarycache  YES      YES   all | none | metadata
        setuid          YES      YES   on | off
        sharenfs        YES      YES   on | off | share(1M) options
        sharesmb        YES      YES   on | off | sharemgr(1M) options
        snapdir         YES      YES   hidden | visible
        snapshot_count  YES       NO   <count>
        snapshot_limit  YES       NO   <count> | none
        sync            YES      YES   standard | always | disabled
        utf8only         NO      YES   on | off
        version         YES       NO   1 | 2 | 3 | 4 | 5 | current
        volblocksize     NO      YES   512 to 128k, power of 2
        volmode         YES      YES   default | geom | dev | none
        volsize         YES       NO   <size>
        vscan           YES      YES   on | off
        xattr           YES      YES   on | off
        userused@...     NO       NO   <size>
        groupused@...    NO       NO   <size>
        userquota@...   YES       NO   <size> | none
        groupquota@...  YES       NO   <size> | none
        written@<snap>   NO       NO   <size>

dlavigne · Aug 26, 2015

This can be done at dataset creation time (http://doc.freenas.org/9.3/freenas_storage.html#create-dataset) and zvol creation time (http://doc.freenas.org/9.3/freenas_storage.html#create-zvol).

mattlach · Aug 26, 2015

dlavigne said:
This can be done at dataset creation time (http://doc.freenas.org/9.3/freenas_storage.html#create-dataset) and zvol creation time (http://doc.freenas.org/9.3/freenas_storage.html#create-zvol).

Ahh, thank you, is that the "recordsize" setting?

The manual speaks to matching it to databases. I will have to do some more research to find out if what has been suggested to me about using a larger block size will really be beneficial to me.

mav@ · Aug 28, 2015

Large blocks don't really affect very small files, since everything that is smaller then one block is stored as-is without any rounding. On the other side files above one block are rounded up to the full next block size. But if compression is enabled, it should eat the overhead.

Large blocks may improve performance of large accesses, reduce metadata overhead and improve compression. But large blocks may penalize small random access, especially random write, since block in ZFS is a minimal read/write element, and each small client I/O will require at least one full block to be red/written.

mattlach · Aug 28, 2015

mav@ said:
Large blocks don't really affect very small files, since everything that is smaller then one block is stored as-is without any rounding. On the other side files above one block are rounded up to the full next block size. But if compression is enabled, it should eat the overhead.

Large blocks may improve performance of large accesses, reduce metadata overhead and improve compression. But large blocks may penalize small random access, especially random write, since block in ZFS is a minimal read/write element, and each small client I/O will require at least one full block to be red/written.

thank you for that.

Yeah my FreeNAS hosts - among other things - my MythTV DVR files, so lots of many GB recorded mpeg2 streams, which are often loaded and played at the same time on different screens.

I read an article that suggested that one of the prime applications of increasing the block size to 1MB was for multiple simultaneously streamed video files, where pre-cache is ineffective.

It did warn about RAM overhead though. Disk overhead is mitigated by compression, but each individual file in RAM will eat up blocksize/2 on average of RAM.

My theory is that if I apply it only for the dataset that contains my DVR recordings it won't be too bad. Sure that dataset contains A LOT of recordings, but only a handful are likely to be in ram at any given time.

cyberjock · Sep 1, 2015

Came out with 9.3.0...

Code:

[root@mini] ~# zpool get all tank
NAME  PROPERTY  VALUE  SOURCE
tank  size  7.25T  -
tank  capacity  53%  -
tank  altroot  /mnt  local
tank  health  ONLINE  -
tank  guid  18052417683864361512  default
tank  version  -  default
tank  bootfs  -  default
tank  delegation  on  default
tank  autoreplace  off  default
tank  cachefile  /data/zfs/zpool.cache  local
tank  failmode  wait  default
tank  listsnapshots  off  default
tank  autoexpand  off  default
tank  dedupditto  0  default
tank  dedupratio  1.00x  -
tank  free  3.34T  -
tank  allocated  3.91T  -
tank  readonly  off  -
tank  comment  -  default
tank  expandsize  -  -
tank  freeing  0  default
tank  fragmentation  1%  -
tank  leaked  0  default
tank  feature@async_destroy  enabled  local
tank  feature@empty_bpobj  active  local
tank  feature@lz4_compress  active  local
tank  feature@multi_vdev_crash_dump  enabled  local
tank  feature@spacemap_histogram  active  local
tank  feature@enabled_txg  active  local
tank  feature@hole_birth  active  local
tank  feature@extensible_dataset  enabled  local
tank  feature@embedded_data  active  local
tank  feature@bookmarks  enabled  local
tank  feature@filesystem_limits  enabled  local
tank  feature@large_blocks  enabled  local <<<<<<<---------

Yep.. minimizes overhead for large files that can use 1MB blocks. Yay! You started using it the second your zpool was upgraded to use the new feature flags with 9.3.0. :)

RAIDTester · Feb 9, 2017

We're on 9.3 and I'm getting this error
How do I get it to create 1024K block sizes?

mav@ · Feb 9, 2017

RAIDTester said:
How do I get it to create 1024K block sizes?

Bigger blocks are available only for datasets. ZVOLs are still limited by 128KB, since bigger usually don't make sense there.

RAIDTester · Feb 9, 2017

mav@ said:
Bigger blocks are available only for datasets. ZVOLs are still limited by 128KB, since bigger usually don't make sense there.

Oh....
I got exponentially better performance on FreeNAS directly, by changing the recordsize on my pool to 1M and wanted to do the same here, since iSCSI to windows is only performing about 10-15% of what it's doing on FreeNAS directly. (This is a 72T test pool of mirrored vdevs)

mav@ · Feb 9, 2017

While large sizes may indeed improve performance for linear operations, for writes, especially random, they cause too big read-modify-write penalty for typically short accesses like iSCSI. Datasets have some kind of special optimizations to minimize that at least for linear access, while ZVOLs are not. While default for ZVOLs 8KB blocks are indeed pretty small and may hurt performance, FreeNAS defaults to 16-32KB, that should be better. If needed, increasing up to 128KB should be possible.

RAIDTester · Feb 10, 2017

mav@ said:
While large sizes may indeed improve performance for linear operations, for writes, especially random, they cause too big read-modify-write penalty for typically short accesses like iSCSI. Datasets have some kind of special optimizations to minimize that at least for linear access, while ZVOLs are not. While default for ZVOLs 8KB blocks are indeed pretty small and may hurt performance, FreeNAS defaults to 16-32KB, that should be better. If needed, increasing up to 128KB should be possible.

I wonder if the 128k max is killing performance. Increases iops by factor of 8 for the same data, thus explaining my performance dropping to 1/8 of on-system.

rs225 · Feb 10, 2017

RAIDTester said:
I wonder if the 128k max is killing performance. Increases iops by factor of 8 for the same data, thus explaining my performance dropping to 1/8 of on-system.

No, a recordsize of only 128k is not going to hurt performance.

In general, I would be very cautious about using the larger recordsizes. If you feel you must, try 256K. At a certain point, the recordsize is simply too big. Most CPUs have only a limited amount of L1 and L2 cache. When the recordsize becomes too large, the CPU is going to be a lot slower when it comes time to calculate checksums and parity (or compression/decompression).

Next is what this does to I/O. The largest I/O operation on SATA is 128K. Once a block goes over that, you lose the 1:1 match of disk I/O to ZFS I/O. With a 1MB block, ZFS has to issue 8 separate SATA I/Os to the disk to read it. At some point, this becomes foolish. (e.g., OpenZFS is working on 16MB block support)

Next, on raidz, you are diluting parity. A 128K block consists of 256 sectors. If 1 sector is bad on one drive, then you must hope that the other drive doesn't just so happen to have any other bad sector in that matching 128 (raidz1 with 3 drives) sector area on the other two drives. As you enlarge the blocksize to 1MB, you now need a perfect 1024 sectors each on the other two drives. At 16MB, you need a perfect 16384 sectors each on the other two drives. As I said, at some point this becomes foolish.

mav@ · Feb 10, 2017

In addition to above, bigger blocks create additional stress on OS memory subsystem. After system run for some time, it may be difficult to allocate 1MB of continuous memory due to high fragmentation.

RAIDTester · Feb 10, 2017

I'm confused now.

10GbE x 2
Separate subnets
Separate switches

Here's what I'm getting in iozone for simple read/write on FreeNAS terminal directly:

Code:

        Children see throughput for  1 initial writers  = 1232264.62 KB/sec
        Parent sees throughput for  1 initial writers   = 1232251.57 KB/sec
        Min throughput per process                      = 1232264.62 KB/sec
        Max throughput per process                      = 1232264.62 KB/sec
        Avg throughput per process                      = 1232264.62 KB/sec
        Min xfer                                        = 209715200.00 KB

        Children see throughput for  1 readers          = 2137824.25 KB/sec
        Parent sees throughput for  1 readers           = 2137751.54 KB/sec
        Min throughput per process                      = 2137824.25 KB/sec
        Max throughput per process                      = 2137824.25 KB/sec
        Avg throughput per process                      = 2137824.25 KB/sec
        Min xfer                                        = 209715200.00 KB

and over iSCSI with 128k blocks formatted to NTFS (64k)

Code:

        Children see throughput for  1 initial writers  =  802828.69 KB/sec
        Parent sees throughput for  1 initial writers   =  802719.39 KB/sec
        Min throughput per process                      =  802828.69 KB/sec
        Max throughput per process                      =  802828.69 KB/sec
        Avg throughput per process                      =  802828.69 KB/sec
        Min xfer                                        = 209715200.00 KB

        Children see throughput for  1 readers          =  314123.22 KB/sec
        Parent sees throughput for  1 readers           =  314102.20 KB/sec
        Min throughput per process                      =  314123.22 KB/sec
        Max throughput per process                      =  314123.22 KB/sec
        Avg throughput per process                      =  314123.22 KB/sec
        Min xfer                                        = 209715200.00 KB

Even if iSCSI causes losses, why is read speed slower than write speed? It should be at least as fast, no?
Am I missing something? Possibly misconfigured something?

mav@ · Feb 10, 2017

How do you have your "10GbE x 2" configured? If it is multipath, some configurations sending consequential requests via different links may confuse ZFS read prefetcher, hurting performance. Though I am not sure it is a case here.

rs225 · Feb 10, 2017

Have you tested iSCSI with 64k blocksize?

RAIDTester · Feb 10, 2017

mav@ said:
How do you have your "10GbE x 2" configured? If it is multipath, some configurations sending consequential requests via different links may confuse ZFS read prefetcher, hurting performance. Though I am not sure it is a case here.

You may be onto something. I'm getting a slight boost in read speed without MPIO.

Code:

     Children see throughput for  1 readers          =  544140.12 KB/sec
     Parent sees throughput for  1 readers           =  544047.72 KB/sec
     Min throughput per process                      =  544140.12 KB/sec
     Max throughput per process                      =  544140.12 KB/sec
     Avg throughput per process                      =  544140.12 KB/sec
     Min xfer                                        = 157286400.00 KB

What concerns me (same as it had before setting recordsize to 1M), is the disks are not busy! They aren't even trying.
Notice the % busy - barely even hitting 50%

Code:

#gstat
L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    3    166    166  21210    7.7      0      0    0.0   47.3| da31
    0    149    149  19038    5.5      0      0    0.0   33.2| da32
    0    156    156  19932    5.5      0      0    0.0   32.7| da33
    0    152    152  19421    5.2      0      0    0.0   30.2| da34
    0    159    159  20316    6.1      0      0    0.0   35.9| da35
    0    173    173  22105    4.9      0      0    0.0   31.1| da36
    0    139    139  17760    5.6      0      0    0.0   29.1| da37
    2    150    150  19166    6.2      0      0    0.0   35.0| da38
    1    152    152  19421    5.4      0      0    0.0   31.2| da39
    3    153    153  19549    5.2      0      0    0.0   32.4| da40

rs225 said:
Have you tested iSCSI with 64k blocksize?

Yes, tried 64 to "align" with NTFS - worse performance.

mav@ · Feb 10, 2017

To saturate all disks ZFS may need significant read-ahead. Take a look on vfs.zfs.zfetch.max_distance sysctl.

RAIDTester · Feb 10, 2017

Just left office. Will check tomorrow. What do you suggest this be set at?

Important Announcement for the TrueNAS Community.

Largeblocks?

Patron

Patron

dlavigne

Guest

Patron

iXsystems

Patron

Inactive Account

Dabbler

iXsystems

Dabbler

iXsystems

Dabbler

Guru

iXsystems

Dabbler

iXsystems

Guru

Dabbler

iXsystems

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Largeblocks?"

Similar threads