Zfs layout - best write performance?

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Hello,
I have a truenas system built with 256gb ram, and a 100gbe fiber card in it, with 16 bays of toshiba drives 7200 drives. I also have a 2tb nvme that can be used as cache.

I am in need to make it work as fast as technically possible, handling write performance as high as possible, read performance can suffer a bit, but maybe we can get a compromise.
Plan is to ingest in the storage 6tb per day, as fast as possible from ssd raid storage on thunderbolt3 on windows client.

I have tried and now i have 8x mirror sets that make it to 780MB/sec write and read performance. Is there a possibility to make it run faster? My pc clients have 25gbe cards.

Thanks!
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I once wrote here about a severely imbalanced system like this and described it as trying to water a potted plant with a firehose.

That is still appropriate.

Write speeds will be limited based on the size of the transaction groups, the speed of the disk I/O (number of vdevs), and the fragmentation on the pool, which in turn derives from the percentage full. If you can keep your storage percentage relatively low, you will get better write speeds.

Read speeds are addressed by the number of vdevs, the amount of ARC/L2ARC, and the workload being small enough that it can be reasonably cached with a high hit rate.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Plan is to ingest in the storage 6tb per day, as fast as possible from ssd raid storage on thunderbolt3 on windows client.

Can you describe your workload in a little more detail? I'll ask some questions to guide things along.

How large are the files being copied, and what protocol is being used (I assume SMB for the latter because of the Windows clients.)

Will the writes happen separately from the reads, or are they happening at the same time?

Will each client PC be writing during its own window, or will they all be writing simultaneously?

Are you ingesting "6TB of new data per day" and never deleting the old data - or is it data that rotates in and out, such as nightly backups that are retained for a given period?

Depending on the read/write behaviour, changing the pool configuration to a multiple-vdev Z2 might help, but only if the workload trends more towards "a single device is writing at a time" and you're reading the data in a similarly "large, single-client, and sequential" manner.
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Hello and i apologise for the late reply.

I will try to be as exact as possible.

We are dealing with a nightly ingest of large raw video files (form 25gb to 200gb), from Arri cameras, all in 4k rezolution. We are ingesting from a windows workstation with thunderbolt 3 from 4x M2 raid0 drives, that go about 2.5 GB/s when copying.
My plan is to make the transfer form those drives as fast as hardware possible to our truenas system via smb or other windows compatible protocol.

Our Truenas system is a 16bay supermicro chassis, with 16x 12tb Toshiba drives (MFG:20A0A1WQFDUG ), a 2tb m2 ssd that is used as a cache (not configured yet), 256GB RAM and a 100gbe card. We go via a 25gbe switch that has 100gbe uplink, on to our workstations that have atto 25gbe cards.

We will be copying only from that workstation during the night, while working on processing the footage from the ssd m2 raid while we are copying. We got about 14 hours to copy the footage to the storage and process it, so ideally, the copying should go as fast as possible.
The copy will be just selecting the main daily folder (DAY01, etc), and copying the whole folder with subfolders and files to the storage.

Of course, if possible, after the copy is done, we have 2 steps:
1. First we could process (render) the video conversions form the storage to our internal pc workstation ssd, in order to release the ssd external raid for reuse.
2. After the render is finished, we have a mac that would suck out 600MB/sec while writing to LTO form that storage.

So theoretically we do a sequential copy, and a sequential read, as we do not go from something that was written today to something that was written 2 days ago. We play and render out files from start to finish all the files that are in a timeline in date created order.

The data will remain on the storage for a few days, while its being written to LTO and then we delete it.

Having all of the above info, what would be a good pool configuration? Can i get close to 2GB/sec with the 16 drives?

Thank you,
Sebastian
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
Start your copy, on the Truenas, go the the shell/command prompt and run gstat -f da monitor it for a bit. See how busy your 16 drives are. Check NIC temp and cpu temp. Speed will slow as it all warms up. A stripe would be fastest but no safety. A single drive dies and that is it - could be worth the risk for a few days. Increase smart to run more often to keep failure at bay. Next a Mirror, (gives) 50% redundancy. The max performance to expect with 16 drives assuming sas2/sas3 6G/12G and 100% ideal conditions over the wire is likely going to be 13-14Gb and more likely about 9-11Gb sustained or about 1100-1300MB. Peaks are likely not above 19Gb. Assuming Chelsio/Intel.

This is tuned and ideal conditions, not accounting for heat, turn up and errors over the wire. NFS? iSCSI? Also you likely want to run with no sync but again that is a safety issue.

If the storage is more like a swap area for the footage, a stripe with no sync is going to be fastest. But huge risk.

The LTO at 600/MB reads should be pretty easy for the Mac.
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Thank you!
I am currently running a stripe with all disks. Seems that it tops out at 800MB/Sec writing, but reads are capped at 300MB/Sec. Should have been the other way around at least, no?:)

Smb multichannel should be off?
iSCSI sound good as it should be fastest, but wont be able to access it on the mac. Or is there some way to share it?

Nfs on windows was capped at 200MB/sec….
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Seems that it tops out at 800MB/Sec writing, but reads are capped at 300MB/Sec. Should have been the other way around at least, no?:)

No. Writes are cached directly to main memory, and then staged out to disk, so write speeds will be the lower of the speed at which you can put it into main memory or the speed of your pool. Reads are necessarily limited because the NAS has to go out to disk to fetch your data, and that takes more time.
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Thanks!

So basically means that the system in just one single stripe has the maximum speed of ±200MB/s disk only speed....
Isn't that kind of low for 16 drives agregate straight read from disk?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Isn't that kind of low for 16 drives agregate straight read from disk?

So think about this for a moment. You're reading a file. It could be a large sequential file, the best possible case. But let's explore other options.

If you are reading lots of medium size (let's say 2MB) sequential files, the problem is that your file is stored in two 1MB ZFS blocks. ZFS can guess that there's a good chance that, once you've read the first 1MB block, you might read the second 1MB block, so it speculatively reads that second 1MB block, and can then fetch that from ARC when you try to read it. That's great. But eventually you hit the end of that second block. Now ZFS cannot predict what you're going to read next, so it has to wait for an open() syscall to be made, has to seek through the directory structure and to the inode for the block list for that new file. This involves seeking, which causes a delay and a noticeable pause in your transfer throughput.

Now let's go back to your "large sequential" file. Let's say it's a petabyte in size. ZFS sees you sequentially reading through the file, so it prefetches blocks to optimize. But how many blocks should it prefetch? If you've got 1TB of ARC, does it prefetch until it runs out of RAM? Or should there be some cap on it so that it doesn't go wild and wipe out everything in ARC? What happens in the case where you really only wanted to read the first 1MB of that file? You end up with a lot of worthless prefetch I/O.

See, these are the two fundamental issues. ZFS can sometimes have an idea about what it would be handy to prefetch (the first case we discussed), but is generally bounded by the end of file and not knowing what the next thing the user will request is. In other cases, it could theoretically prefetch a huge amount, but this could impact performance negatively in several ways.

Basically with writes, ZFS doesn't have to guess at what will happen next. It knows, because it just has to look at the write stream. However, with reads, it is ambiguous. How far do you push that prefetch? If you even know what to prefetch? If you don't know, then you have to wait for some clarity. That tends to kill single-client read performance.

Now I should point out that I carefully selected the words in that last sentence. When you have sixteen vdevs, you will find that you can be serving lots of clients concurrently. You might well get your 800MBytes/sec being read, or even more in aggregate. And of course, this being ZFS, there is some tuning possible to optimize towards one use case or another.
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
Did you run the gstat while running your copy? I would try to narrow down the bottleneck. This seems strange 800MB tells me that a 10G cap is sitting somewhere. Maybe the NIC Card? IPerf? Maybe drive IO? FIO or WINSAT?
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Hello,

Here is the gstat output from a write to NAS. 6.5Gbps limited network send all throughout the copy.

Thank you!
 

Attachments

  • Screenshot 2022-10-29 at 22.08.12.png
    Screenshot 2022-10-29 at 22.08.12.png
    778.4 KB · Views: 788

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Hello,

Here is the gstat output from a write to NAS. 6.5Gbps limited network send all throughout the copy.

Thank you!

That seems subpar. I would expect it to be able to do maybe 3x-4x that.

What have you done for network tunables?
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Hello,

Nothing for tunables since i made a single stripe with all 16 disks. I had some before with smb multichannel on but did not help much.

Is there a recommended one? My main 25gbe switch is currently in RMA and will return soon. I am running now on 10gbe cisco 4500x, but either way, speeds should be higher...

Thank you!
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
Do you have sync on the pool standard? I will assume so, disable it and rerun the test, please/thank you. While you are at it, I did not see what model of card and firmware you ware running from your NIC. Also what is the HBA you are running and firmware as well?
 

Tipul01

Dabbler
Joined
Oct 19, 2022
Messages
10
Hello,

sas3flash list says this:

Adapter Selected is a Avago SAS: SAS3008(C0)

Controller Number : 0
Controller : SAS3008(C0)
PCI Address : 00:01:00:00
SAS Address : 52cea7f-0-4502-6800
NVDATA Version (Default) : 0e.01.00.37
NVDATA Version (Persistent) : 0e.01.00.37
Firmware Product ID : 0x2221 (IT)
Firmware Version : 16.00.08.00
NVDATA Vendor : LSI
NVDATA Product ID : Dell 12Gbps HBA
BIOS Version : 08.37.00

Nic cards are:

67:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 09)
67:00.1 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 09)
b3:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]

I do not know how to show driver version and firmware on the truenas. ethtool and lshw commands are not installed. I am inputing the commands in the shell from the web browser.

Thank you!
 

Syptec

Dabbler
Joined
Aug 3, 2018
Messages
42
I assume no LAGG using LACP? IF that is true, the 10G is the limit. The cards you outline are not going to run faster that I am aware, regardless of what spec may say. I suggest testing with IPERF3 and confirm the max capability of the NICs.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I assume no LAGG using LACP? IF that is true, the 10G is the limit. The cards you outline are not going to run faster that I am aware, regardless of what spec may say. I suggest testing with IPERF3 and confirm the max capability of the NICs.

10G is the limit regardless. 802.3ad requires ethernet deliver packets in the same order that they're passed off to the network; this is discussed in


This effectively means that a single TCP connection can never go faster than an LACP component member. Even if you round-robin, you will generally end up with a bunch of CPU load trying to reorder the packets on the receiving end, and this generally kills performance.
 
Top