Zfs layout - best write performance?

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
The Calomel write up should be reflected in the forums here. It is much more useful that what I found here. Happy to send it to you.

We do not plagiarize documents from other sources here. You are of course welcome to compose your own. Also, since I'm aware of at least a dozen discussions of various sets of tunables for various situations and use cases here, I find it puzzling that you would not have stumbled across at least one of those.
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
We do not plagiarize documents from other sources here. You are of course welcome to compose your own. Also, since I'm aware of at least a dozen discussions of various sets of tunables for various situations and use cases here, I find it puzzling that you would not have stumbled across at least one of those.
You are of course welcome to link one to me... Could be useful on this thread...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You are of course welcome to link one to me... Could be useful on this thread...

Didn't seem too hard to find a bunch. I only gave up because, y'know, *tedious*.
  1. https://www.truenas.com/community/threads/amd-e-350-thread-now-in-new-forum.27/post-32074
  2. https://www.truenas.com/community/threads/browsing-directories-slow.5338/post-31750
  3. https://www.truenas.com/community/threads/zfs-network-tunables-hp-n40l.10610/
  4. https://www.truenas.com/community/threads/extremely-slow-ssh-scp-over-wan.21985/
  5. https://www.truenas.com/community/threads/lagg-with-lacp.19466/post-111121
  6. https://www.truenas.com/community/t...x-and-vfs-zfs-write_limit_shift-please.10902/
  7. https://www.truenas.com/community/t...ally-and-gaps-in-report-info.16782/post-94256
  8. https://www.truenas.com/community/threads/freenas-9-1-performance.13506/post-64981
  9. https://truenas.com/community/threads/10-gig-networking-primer.25749
  10. https://truenas.com/community/threads/10-gig-networking-primer.25749
  11. https://truenas.com/community/threads/10g-network-tcp-tuning.45768/
  12. https://truenas.com/community/threads/10gb-tunables-on-9-10.42548
  13. https://truenas.com/community/threads/10gbe-performance-iperf-good-data-copy-slow.72239
  14. https://truenas.com/community/threads/10gbe-performance-issue-in-freenas-11.56580
  15. https://truenas.com/community/threads/12-tcp-tuning-issues.88164/
  16. https://truenas.com/community/threads/8-0-4-optimizing-buffer-settings.6216/
  17. https://truenas.com/community/threa...freenas-8-2-the-system-started-to-freeze.9028
  18. https://truenas.com/community/threads/amd-e-350-thread-now-in-new-forum.27
  19. https://truenas.com/community/threads/browsing-directories-slow.5338
  20. https://truenas.com/community/threads/connectx-2-great-write-but-slow-read-speed.58384
  21. https://truenas.com/community/threads/extremely-slow-ssh-scp-over-wan.21985/
  22. https://truenas.com/community/threads/extremely-slow-ssh-scp-transfers.15988
  23. https://truenas.com/community/threads/file-transfers-to-freenas-drop-midway-10gbe.72170
  24. https://truenas.com/community/threads/forum-guidelines.45124"
  25. https://truenas.com/community/threads/frage-zu-netzwerk-geschwindigkeit-lwl.57750
  26. https://truenas.com/community/threads/freenas-11-3-10gb-tunables.82435/
  27. https://truenas.com/community/threads/freenas-9-1-performance.13506
  28. https://truenas.com/community/threads/freenas-9-1-slow-read-performance-over-lan.16299/
  29. https://truenas.com/community/threads/freenas-panic-vm_page_free-freeing-busy-page.10126
  30. https://truenas.com/community/threads/freenas-stopped-responding-now-wont-boot.36516/
  31. https://truenas.com/community/threads/fresh-install-slow-transfer-speeds.58610
  32. https://truenas.com/community/threa...x-and-vfs-zfs-write_limit_shift-please.10902/
  33. https://truenas.com/community/threads/high-checksum-error-rate.42376
  34. https://truenas.com/community/threads/how-to-setup-vlans-within-freenas-11-3.81633
  35. https://truenas.com/community/threads/intel-x540t2-wont-run-past-300mbps.71108
  36. https://truenas.com/community/threa...if-share-is-slow-and-something-will-stop.9370
  37. https://truenas.com/community/threads/lagg-with-lacp.19466
  38. https://truenas.com/community/threads/loaders-in-webgui-not-working-in-8-0-3.5429
  39. https://truenas.com/community/threads/locking-up-periodically-and-gaps-in-report-info.16782
  40. https://truenas.com/community/threads/mediocre-10ge-network-performance.22324
  41. https://truenas.com/community/threads/nvme-zfs-pool-over-100gbe.91765
  42. https://truenas.com/community/threads/ottimizzare-intel-x520.100293
  43. https://truenas.com/community/threads/slow-write-speed-on-fast-hardware-smb-cifs-bottleneck.42641
  44. https://truenas.com/community/threa...am-utc-after-upgrade-to-9-3-release-p31.41689
  45. https://truenas.com/community/threads/ssd-cache-drive.8774
  46. https://truenas.com/community/threads/upgrade-recommendations.69981
  47. https://truenas.com/community/threa...freenas-but-fast-to-vm-in-freenas-bhyve.56390
  48. https://truenas.com/community/threads/very-slow-smb-read-speeds-after-truenas-12-upgrade.88312
  49. https://truenas.com/community/threads/yet-another-poor-cifs-performance-thread.12102
  50. https://truenas.com/community/threads/yet-another-zfs-tuning-thread.10140/
  51. https://truenas.com/community/threads/zfs-network-tunables-hp-n40l.10610/
  52. https://truenas.com/community/threads/zfs-speeds-not-consistent.11589/
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
Being more specific. 10G and 40G tuning to improve network performance. If there really is any gain. Truthfully Truenas has done pretty well. The 40G cards from Chelsio needs a bit of tuning but outside that performance out of the box shows little gain with extensive tuning, it became futile.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
If there really is any gain.

There generally is, but network tuning typically favors slower systems. The larger buffer sizes allow more optimized send and receive operations, with fewer interrupts. Faster CPU's get around to servicing the interrupts more quickly.

There was a time in the late '90's and early '00's when gigabit Ethernet was young and gigE was new, when at least moderate tuning effort had to be spent to get full gigE.

Within just a few years, 10GbE came out, but the processing overhead for 10GbE was so much greater than gigE that CPU reduction strategies such as jumbo packets, TSO, LRO, interrupt coalescing, multiple queues, etc., became major chipset differentiators in the 10G world. Because CPU speeds did not increase dramatically over the next decade, these remained major factors, and companies like Intel and Chelsio who were cranking out performance-optimized chipsets made many breakthroughs in design. The second and third generations of 10GbE ethernet chipsets are much more sophisticated than even the best of the gigE chipsets, but this comes at a complexity cost. Some of them need tuning for queues and interrupts. Virtually all of them need to have their buffer sizes enlarged, because both FreeBSD and Linux are default optimized towards gigE-based networks.

How MUCH of a benefit you get from such tuning depends greatly on many factors, including which ethernet chipsets we're discussing, and how fast your CPU is. If you have a really fast, hot CPU and a system that is lightly loaded, then you are likely to see less of an improvement from tunables.
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
I am all ears for direction of 10G/40G Chelsio tuning and 10G Intel. We use a lot of Chelsio. Seems to perform a bit better. NFSv3. We are not really seeing bottlenecks but never seeing the processor get to spicy. e5-2667v3 (3.7Ghz). Very specific direction would be most useful.

*haven't found much on the forms yet, but looking.
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Hello,

Thank you for the thorough responses.
Fact of the matter is, yes, for now, i am running on 10gbe card, but the main card i used in the past, and will use once i get the RMA on my 25gbe switch, will be the melanox 100gbe card. And my PCs have 25gbe cards, intel xx710, so still intel.
But it should run better than 6,5gbe. So tunables for 25gbe or 100gbe are a bit hard to find:)

I will switch the intel cards to ATTO N322 and see if without any tunables, it will run better. Or is it a smb/windows client limitation?
Tried flow control, jumbo frames, (on the switch as well) etc, all seem to not make any difference on the write speed.

Thank you!
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
Never saw a response to the SYNC question??? I just tested with all the same hardware you identified (INTEL XXV710) and with SYNC off I can get 19.778Gb and with it enabled (STANDARD) I get about 800MB to storage. Just an FYI.
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Oh my God. That seems to solve the write issue.

Seems that with sync off, write speed is 10gbs full on, staturated the network card.

Read speed still 300MB/s.

Would the sync feature affect in somehow the transfer? We are under 2 ups backups, so the poweroff risk is manageble.

Can the read be improved? The pool is a 16bay stripe.

I wont go into macos Monterey performance, windows is perfect:))

Thank you!!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Seems that with sync off, write speed is 10gbs full on, staturated the network card.

Hopefully this means "10Gbit/s". This was explained back in post #7 in this thread, writes will generally go as fast as the write cache or writing to the pool allows.

Can the read be improved? The pool is a 16bay stripe.

Read speeds can be improved by:

1) Adding RAM, which provides more ARC, which is your read cache

2) Selecting tunables that increase the prefetch, which really only works for sequential workloads

3) Increase the parallelism of the clients, which is incredibly effective at extracting more read speed, but only in aggregate
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Hopefully this means "10Gbit/s". This was explained back in post #7 in this thread, writes will generally go as fast as the write cache or writing to the pool allows.



Read speeds can be improved by:

1) Adding RAM, which provides more ARC, which is your read cache

2) Selecting tunables that increase the prefetch, which really only works for sequential workloads

3) Increase the parallelism of the clients, which is incredibly effective at extracting more read speed, but only in aggregate
Hello,

Thank you so much! Yes, its Gbit/s:)

For nr 1, i already have 256GB ram installed. Does that need to be configured somehow? I also have 2tb nvme that is not currently used in the pool as anything.

2. I get that, but indeed, not always will the read be sequential.

3. Here i do not get what you mean. How can there be paralelism, if i only have 1 client at one time?

I am so sorry, its a bit of a shock having such difference in read and write speed. I know you explained it before, but coming from multiple raid5 storage systems, although directly attached either via thunderbolt, or sas, or internal pci sas 16bays hdds, the difference is enormous. A graid shuttle xl, 8 bays, thudnerbolt 2 gives out 800MB/S read and write, for example. Another example is my supermicro machine, with an lsi 8885, 16bay raid5, i get 850 read and 900write.

Thank you again so much!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
For #1, if you have 256GB, you can definitely experiment with adding the flash as L2ARC. If you're on Scale with Linux's sucky memory management, you will need to tune the ARC size up. Core's FreeBSD base will do that much better automatically. You will likely also need to boost both

vfs.zfs.l2arc.write_boost: 40000000
vfs.zfs.l2arc.write_max: 10000000

because ZFS sets these to really weedy values by default, I think 8MB or something like that. Which might be appropriate for smallish 32GB RAM systems.

For #2, ARC and L2ARC are the primary mitigating factors for nonsequential pool access. If it doesn't work, then too bad, so sad. Not to put too fine a point on it. Sorry. It is what it is.

For #3, the point is that ZFS is really designed for multiuser fileservers and isn't particularly optimized towards the sort of massive ingest operation you seem to have going. When you have a large number of individual vdevs (or just a stripe), the large I/O operations ZFS tends to favor, with record sizes up to 1MB, tends to mean that you do NOT want the sort of interleaved smaller I/O commonly done by hardware RAID controllers, which causes what Avi refers to as "seek binding" of the actuators in multiple drive units, and instead favors each drive being able to fulfill independent I/O operations. However, while this is GREAT for parallelism in I/O systems, it is less awesome for large amounts of read. It doesn't PREVENT the I/O system from being able to do the large amounts of readahead that would be necessary to successfully read at really high speeds, but it probably will require some tuning to hint to ZFS that you want that sort of behaviour.

Another example is my supermicro machine, with an lsi 8885, 16bay raid5, i get 850 read and 900write.

Yes. The flip side of that coin, though, is that the stripe size you set up at the RAID array creation, maybe 64KB or 256KB or 512KB, means that multiple drives have to get involved in order to be able to read in the same 1MB block that ZFS reads from a single drive. Your RAID array then sucks bigtime if multiple things generate I/O at the same time. Because the hardware RAID array is mostly all about "awesome benchmarks" and not "maximal usefulness".

Unfortunately I do not have much for you in the way of tuning hints for your workload. There are a number of people who have built systems dedicated to video editing, a workload similar to your needs, and you may wish to refer to threads discussing that to see what sort of tuning options might apply.
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
Oh my God. That seems to solve the write issue.

Seems that with sync off, write speed is 10gbs full on, staturated the network card.

Read speed still 300MB/s.

Would the sync feature affect in somehow the transfer? We are under 2 ups backups, so the poweroff risk is manageble.

Can the read be improved? The pool is a 16bay stripe.

I wont go into macos Monterey performance, windows is perfect:))

Thank you!!
RAM is your friend there. not just more, but the fastest speed of ram for whatever motherboard you have and then "more" is better. What controller are you using specifically and what firmware? LSI produces a solid line of controllers that Truenas/freeBSD support pretty well. As much as Intel does, I find that Chelsio NIC perform a bit better overall. Almost all switches, Quanta, Cisco, Intel, you name it all appear to perform very close to one another, so it likely in not the bottle neck.
 
Top