Long-winded technical post coming up; apologies in advance.
From your summary, it appears that the majority of your ARC is "L2ARC eligible" - the only pieces that aren't are probably the prefetched records (
l2arc_noprefetch=1
) - and the likely reason it isn't landing on the cache vdev is because of the limited feed rate (
l2arc_write_max=8388608
) or 8MB per cycle.
L2ARC is fed by scanning the tail end of "blocks that could potentially be evicted" and trying to pick the best candidates to put onto the cache vdev. The length of the "tail scan" is expressed as the product of "
l2arc_write_max * l2arc_headroom * l2arc_headroom_boost%" during normal operation.
Taking the default values, you're scanning
8MB * 2 = 16MB every second, and selecting the most eligible
8MB to put on your SSDs. Not exactly a large amount. You can increase
l2arc_write_max
to a higher value but this has a few potential risks, so you'll want to adjust your tunables slowly (think "double at a time" not "10x at a time") and monitor the L2ARC hitrate and general read/write behavior of the array as you go.
First, your L2ARC device (SSD) will be spending additional time and bandwidth doing writes, and the purpose of the L2ARC is to provide fast reads. Most SSDs experience a "bathtub curve" of performance, where they're able to provide their marketing-rated performance at 100% reads or 100% writes, but once you move away from those scenarios and throw a mixed workload at it - even a 90%/10% blend - traditional NAND loses a chunk of its peak performance. Click the graphs for a full-size.
Intel's 3D XPoint NAND (Optane) is the exception, but has a significantly higher price-tag, so it's not usually used for a read-only workload. If you're considering this, compare the cost of setting up a smaller all-SSD vdev instead, copying your "active data" there, and then migrating it to the "cold storage" tier manually afterwards.
Next, if the L2ARC feed thread is asked to scan through more of your ARC, it will consume additional CPU time. This might not be a huge impact if you've built a system with a lot of CPU, but it's a potential source of bottleneck - you also can't ask it to "feed every second" if a "scan and write" will take more than a second to complete.
Your workload doesn't have as much of a concern about burning out the disks, as by your estimation you're writing 400GB/week to the pool, but I'd like to make a note of caution for anyone else cruising by. If you manage to catch all 400GB every week and split it between your two cache devices that's 200GB/week or only 10.4TB/year. If you quadrupled your feed rate to 32MB/s, then you'll still only be writing 41.6TB/year to those disks. This is manageable. Someone who's constantly writing to their array, though - such as someone hosting virtual machines - even if they're only using the 8MB/s defaults, if their L2ARC device is being fed at 8MB/s, every second of every day - well, that's 675GB in a single day. Quadruple that to help your hitrates, and suddenly you're feeding 2.7T/day - that's enough to push you to almost 1PB in a year, beyond the endurance of most consumer SSDs (eg: the 1TB WD Red SA500 is rated for 600TBW)
Finally, if you're using a 10Gbps network card, bear in mind that it's going to heave data into your system at 1GB/s. You'll end up throttling back to the overall pool speed (as a 7-drive Z2 isn't likely to be able to sustain that) but even at a quarter of that speed, you'd need your SSD to be able to keep up with network-speed writes for the full size of your new data in order to "stay ahead of things." If the SSD can't hit those speeds, or hiccups because it's being asked to handle a read workload (as is L2ARC's job!) then ZFS will just evict the data from ARC and not copy it - it isn't going to hit the brakes on a pending ARC eviction in order to fill the cache vdev.
TL;DR try gradually increasing the value of
l2arc_write_max
but monitor your array for signs of CPU contention and your SSD's lifespan as "butterfly effects."