TrueNAS Scale ZFS Cache Maximization Tweaks?

ajgnet

Explorer
Joined
Jun 16, 2020
Messages
65
Hi guys-We a large 500 TB pool with 4x 8 TB nvme cache drives. From what I can see with zpool iostat -v, they are barely touched. What optimizations can I do to force more items to be placed into the read cache for faster repeat access? We are mostly storing large photos and videos. Thank you in advance
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
You need to actively read data to warm the L2ARC. If the typical working set of all your users fits into memory ARC, L2 won't be touched. Possibly a redundant (!) metadata special vdev would be more of an improvement in your case?
 

ajgnet

Explorer
Joined
Jun 16, 2020
Messages
65
Understood, but it seems like I have to read the same data 5-6x for the cache to warm, which doesn't happen often in my use case. Is there a way perhaps to lower that threshold, or something similar?
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
The L2 will only get used if the memory primary ARC is "full". As for thresholds - sorry. I don't know.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Call this a placeholder so I can effortpost later, but the math for L2ARC header consumption is:

(Size of L2ARC in KB / Average ZFS record size in KB) * 70 = RAM consumption in bytes.

So a single 8T device with a 128K average recordsize (if it's videos and large photos) costs you (64M records * 70B) = about 4.375G of RAM to index it.

If all 4x8T devices are in a single system it's 17.5G - definitely heftier but likely still valuable, if you have a good amount of RAM to begin with.

But I would ask the question that if you can predict your workload that well, would running those large SSDs as a separate pool and manually shuffling data be better?
 

ajgnet

Explorer
Joined
Jun 16, 2020
Messages
65
How much RAM do you have?
512 GB RAM

It's tough to predict the workload
But I would ask the question that if you can predict your workload that well, would running those large SSDs as a separate pool and manually shuffling data be better?
It's tough to predict the workload. We store hundreds of terabytes of photos and videos that are frequently accessed (edited) over the network. Ideally, on first access the media object would be brought into the cache (ARC or L2ARC) so that subsequent accesses don't touch the mechanical drives. But it seems right now with the default configuration, the drives are still being hit
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
The cache works on a block level, not on a file level.
Unless your 512 GB are filled to the brim your L2ARC won't be used. So it can take days or weeks to warm it.

Persistent L2ARC will help but depending on the number files, the number of client systems, and the protocol (SMB?) a metadata vdev might be the better choice.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
It's tough to predict the workload. We store hundreds of terabytes of photos and videos that are frequently accessed (edited) over the network. Ideally, on first access the media object would be brought into the cache (ARC or L2ARC) so that subsequent accesses don't touch the mechanical drives. But it seems right now with the default configuration, the drives are still being hit

Data can never move from pool vdev to L2ARC directly; it can only move from the tail-end of ARC to L2ARC.

Is your ARC presently full (or near-full) and what is the hit-rate?
 

ornias

Wizard
Joined
Mar 6, 2020
Messages
1,458
Data can never move from pool vdev to L2ARC directly; it can only move from the tail-end of ARC to L2ARC.

Is your ARC presently full (or near-full) and what is the hit-rate?
Actually I talked about this with the persistant L2ARC dev back in the day and it is technically possible to tweak parameters in such a way that ARC is writen to L2ARC much sooner.

But indeed, it still has to hit ARC first.
 

James Gardiner

Dabbler
Joined
Jul 14, 2017
Messages
19
Interesting thread but no meat..

I have a test server.. seeing how well this works. TrueNAS SCALE, 7x3TB z2, 2x 2TB SSD L2ARC, 96G ram. I mainly write LARGE files, like 40gig each and about 200 gig for a set of files. I then read them back about 4-5 times doing analysis and rendering passes. I wanted this all to come from the SSDs. But as mentioned. Its not working. When I read the large file sets, they are all coming from the Disks. The L2ARC isn't doing anything.

There must be a way to tweak the L2ARC to cache large files. Even on the first write. (Or improve it)
 
Last edited:

morganL

Captain Morgan
Administrator
Moderator
iXsystems
Joined
Mar 10, 2018
Messages
2,694
Interesting thread but no meat..

I have a test server.. seeing how well this works. TrueNAS SCALE, 7x3TB z2, 2x 2TB SSD L2ARC, 96G ram. I mainly write LARGE files, like 40gig each and about 200 gig for a set of files. I then read them back about 4-5 times doing analysis and rendering passes. I wanted this all to come from the SSDs. But as mentioned. Its not working. When I read the large file sets, they are all coming from the Disks. The L2ARC isn't doing anything.

There must be a way to tweak the L2ARC to cache large files. Even on the first write. (Or improve it)
If you can try CORE, it would be useful to know the differences.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Interesting thread but no meat..

I have a test server.. seeing how well this works. TrueNAS SCALE, 7x3TB z2, 2x 2TB SSD L2ARC, 96G ram. I mainly write LARGE files, like 40gig each and about 200 gig for a set of files. I then read them back about 4-5 times doing analysis and rendering passes. I wanted this all to come from the SSDs. But as mentioned. Its not working. When I read the large file sets, they are all coming from the Disks. The L2ARC isn't doing anything.

There must be a way to tweak the L2ARC to cache large files. Even on the first write. (Or improve it)
Can you dump an arc_summary.py result in [CODE][/CODE] tags? Interested to see if it's marking things as ineligible, or if it's just a matter of the default feed rate being too low.

Note that increasing the feed rate means more wear on your SSD. It's a balancing act.
 

James Gardiner

Dabbler
Joined
Jul 14, 2017
Messages
19
Hi,
Did a little more testing this morning. And defiantly reading from a large 150gig file. No matter local or shared. It comes from the disks, never the L2. If I read a 50gig file, that's big enough to fit into ARC, it will result in no throughput on the disks. As I imagine its coming out of memory. But with the 150gig file. Always from Disk.

Below is the output of arc_summary.
I am using SCALE as I want to see how well it works. I use docker containers a lot and hosting some of the more IO-intensive containers directly on the TRUENAS server sounded interesting. This is why I am doing these tests. But if I cannot get the IO speed, it's a little academic.

SSD ware wise. The storage array only has about 400gig coming per week at the moment. (ie of rites) and is mostly for reading. And typically older files get read little, and the newer ones are busy for a month or so. So I wanted an L2arc that could hold all of the most active data So I am not worried the SSD will ware-out. Well not fast anyway. A server 1-2TB of RAM would be best I suppose, but this is a side project and I don;t want to spend a lot on it. (A server that can hold that much ram and the ram alone would be prohibitively expensive.)

Code:
root@truenas:/# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Oct 20 10:16:41 2022
Linux 5.10.142+truenas                                           2.1.5-1
Machine: truenas (x86_64)                                        2.1.5-1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    99.9 %   62.9 GiB
        Target size (adaptive):                       100.0 %   62.9 GiB
        Min size (hard limit):                          6.2 %    3.9 GiB
        Max size (high water):                           16:1   62.9 GiB
        Most Frequently Used (MFU) cache size:         69.9 %   43.6 GiB
        Most Recently Used (MRU) cache size:           30.1 %   18.8 GiB
        Metadata cache size (hard limit):              75.0 %   47.2 GiB
        Metadata cache size (current):                  1.2 %  594.2 MiB
        Dnode cache size (hard limit):                 10.0 %    4.7 GiB
        Dnode cache size (current):                     0.6 %   30.3 MiB

ARC hash breakdown:
        Elements max:                                               1.4M
        Elements current:                              99.9 %       1.4M
        Collisions:                                               777.6k
        Chain max:                                                     4
        Chains:                                                    57.5k

ARC misc:
        Deleted:                                                    3.1M
        Mutex misses:                                               1.3k
        Eviction skips:                                             9.4k
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                   450.0 GiB
        L2 eligible evictions:                                  18.4 GiB
        L2 eligible MFU evictions:                     74.3 %   13.7 GiB
        L2 eligible MRU evictions:                     25.7 %    4.7 GiB
        L2 ineligible evictions:                               106.9 MiB

ARC total accesses (hits + misses):                                11.9G
        Cache hit ratio:                              100.0 %      11.8G
        Cache miss ratio:                             < 0.1 %       4.4M
        Actual hit ratio (MFU + MRU hits):            100.0 %      11.8G
        Data demand efficiency:                       100.0 %     693.9M
        Data prefetch efficiency:                       7.9 %       4.7M

Cache hits by cache type:
        Most frequently used (MFU):                    98.6 %      11.7G
        Most recently used (MRU):                       1.4 %     168.3M
        Most frequently used (MFU) ghost:             < 0.1 %       9.8k
        Most recently used (MRU) ghost:               < 0.1 %     219.8k
        Anonymously used:                             < 0.1 %     164.4k

Cache hits by data type:
        Demand data:                                    5.9 %     693.9M
        Demand prefetch data:                         < 0.1 %     371.3k
        Demand metadata:                               94.1 %      11.2G
        Demand prefetch metadata:                     < 0.1 %      23.2k

Cache misses by data type:
        Demand data:                                    0.4 %      17.4k
        Demand prefetch data:                          99.2 %       4.3M
        Demand metadata:                                0.3 %      13.5k
        Demand prefetch metadata:                       0.1 %       5.9k

DMU prefetch efficiency:                                           12.2M
        Hit ratio:                                     25.7 %       3.1M
        Miss ratio:                                    74.3 %       9.1M

L2ARC status:                                                    HEALTHY
        Low memory aborts:                                             0
        Free on write:                                                29
        R/W clashes:                                                   0
        Bad checksums:                                                 0
        I/O errors:                                                    0

L2ARC size (adaptive):                                         167.2 GiB
        Compressed:                                    97.0 %  162.1 GiB
        Header size:                                  < 0.1 %   37.6 MiB
        MFU allocated size:                            78.6 %  127.5 GiB
        MRU allocated size:                            21.4 %   34.7 GiB
        Prefetch allocated size:                      < 0.1 %    2.1 MiB
        Data (buffer content) allocated size:          99.9 %  162.0 GiB
        Metadata (buffer content) allocated size:       0.1 %  140.0 MiB

L2ARC breakdown:                                                    4.3M
        Hit ratio:                                      0.3 %      10.9k
        Miss ratio:                                    99.7 %       4.3M
        Feeds:                                                    156.8k

L2ARC writes:
        Writes sent:                                    100 %      47.4k

L2ARC evicts:
        Lock retries:                                                  0
        Upon reading:                                                  0

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       8388608
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 1
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

Tunables:
        dbuf_cache_hiwater_pct                                        10
        dbuf_cache_lowater_pct                                        10
        dbuf_cache_max_bytes                        18446744073709551615
        dbuf_cache_shift                                               5
        dbuf_metadata_cache_max_bytes               18446744073709551615
        dbuf_metadata_cache_shift                                      6
        dmu_object_alloc_chunk_shift                                   7
        dmu_prefetch_max                                       134217728
        ignore_hole_birth                                              1
        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_headroom_boost                                         200
        l2arc_meta_percent                                            33
        l2arc_mfuonly                                                  0
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_rebuild_blocks_min_l2size                       1073741824
        l2arc_rebuild_enabled                                          1
        l2arc_trim_ahead                                               0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608
        metaslab_aliquot                                          524288
        metaslab_bias_enabled                                          1
        metaslab_debug_load                                            0
        metaslab_debug_unload                                          0
        metaslab_df_max_search                                  16777216
        metaslab_df_use_largest_segment                                0
        metaslab_force_ganging                                  16777217
        metaslab_fragmentation_factor_enabled                          1
        metaslab_lba_weighting_enabled                                 1
        metaslab_preload_enabled                                       1
        metaslab_unload_delay                                         32
        metaslab_unload_delay_ms                                  600000
        send_holes_without_birth_time                                  1
        spa_asize_inflation                                           24
        spa_config_path                             /etc/zfs/zpool.cache
        spa_load_print_vdev_tree                                       0
        spa_load_verify_data                                           1
        spa_load_verify_metadata                                       1
        spa_load_verify_shift                                          4
        spa_slop_shift                                                 5
        vdev_file_logical_ashift                                       9
        vdev_file_physical_ashift                                      9
        vdev_removal_max_span                                      32768
        vdev_validate_skip                                             0
        zap_iterate_prefetch                                           1
        zfetch_array_rd_sz                                       1048576
        zfetch_max_distance                                      8388608
        zfetch_max_idistance                                    67108864
        zfetch_max_streams                                             8
        zfetch_min_sec_reap                                            2
        zfs_abd_scatter_enabled                                        1
        zfs_abd_scatter_max_order                                     13
        zfs_abd_scatter_min_size                                    1536
        zfs_admin_snapshot                                             0
        zfs_allow_redacted_dataset_mount                               0
        zfs_arc_average_blocksize                                   8192
        zfs_arc_dnode_limit                                            0
        zfs_arc_dnode_limit_percent                                   10
        zfs_arc_dnode_reduce_percent                                  10
        zfs_arc_evict_batch_limit                                     10
        zfs_arc_eviction_pct                                         200
        zfs_arc_grow_retry                                             0
        zfs_arc_lotsfree_percent                                      10
        zfs_arc_max                                                    0
        zfs_arc_meta_adjust_restarts                                4096
        zfs_arc_meta_limit                                             0
        zfs_arc_meta_limit_percent                                    75
        zfs_arc_meta_min                                               0
        zfs_arc_meta_prune                                         10000
        zfs_arc_meta_strategy                                          1
        zfs_arc_min                                                    0
        zfs_arc_min_prefetch_ms                                        0
        zfs_arc_min_prescient_prefetch_ms                              0
        zfs_arc_p_dampener_disable                                     1
        zfs_arc_p_min_shift                                            0
        zfs_arc_pc_percent                                             0
        zfs_arc_prune_task_threads                                     1
        zfs_arc_shrink_shift                                           0
        zfs_arc_shrinker_limit                                     10000
        zfs_arc_sys_free                                               0
        zfs_async_block_max_blocks                  18446744073709551615
        zfs_autoimport_disable                                         1
        zfs_checksum_events_per_second                                20
        zfs_commit_timeout_pct                                         5
        zfs_compressed_arc_enabled                                     1
        zfs_condense_indirect_commit_entry_delay_ms                    0
        zfs_condense_indirect_obsolete_pct                            25
        zfs_condense_indirect_vdevs_enable                             1
        zfs_condense_max_obsolete_bytes                       1073741824
        zfs_condense_min_mapping_bytes                            131072
        zfs_dbgmsg_enable                                              1
        zfs_dbgmsg_maxsize                                       4194304
        zfs_dbuf_state_index                                           0
        zfs_ddt_data_is_special                                        1
        zfs_deadman_checktime_ms                                   60000
        zfs_deadman_enabled                                            1
        zfs_deadman_failmode                                        wait
        zfs_deadman_synctime_ms                                   600000
        zfs_deadman_ziotime_ms                                    300000
        zfs_dedup_prefetch                                             0
        zfs_delay_min_dirty_percent                                   60
        zfs_delay_scale                                           500000
        zfs_delete_blocks                                          20480
        zfs_dirty_data_max                                    4294967296
        zfs_dirty_data_max_max                                4294967296
        zfs_dirty_data_max_max_percent                                25
        zfs_dirty_data_max_percent                                    10
        zfs_dirty_data_sync_percent                                   20
        zfs_disable_ivset_guid_check                                   0
        zfs_dmu_offset_next_sync                                       1
        zfs_embedded_slog_min_ms                                      64
        zfs_expire_snapshot                                          300
        zfs_fallocate_reserve_percent                                110
        zfs_flags                                                      0
        zfs_free_bpobj_enabled                                         1
        zfs_free_leak_on_eio                                           0
        zfs_free_min_time_ms                                        1000
        zfs_history_output_max                                   1048576
        zfs_immediate_write_sz                                     32768
        zfs_initialize_chunk_size                                1048576
        zfs_initialize_value                        16045690984833335022
        zfs_keep_log_spacemaps_at_export                               0
        zfs_key_max_salt_uses                                  400000000
        zfs_livelist_condense_new_alloc                                0
        zfs_livelist_condense_sync_cancel                              0
        zfs_livelist_condense_sync_pause                               0
        zfs_livelist_condense_zthr_cancel                              0
        zfs_livelist_condense_zthr_pause                               0
        zfs_livelist_max_entries                                  500000
        zfs_livelist_min_percent_shared                               75
        zfs_lua_max_instrlimit                                 100000000
        zfs_lua_max_memlimit                                   104857600
        zfs_max_async_dedup_frees                                 100000
        zfs_max_log_walking                                            5
        zfs_max_logsm_summary_length                                  10
        zfs_max_missing_tvds                                           0
        zfs_max_nvlist_src_size                                        0
        zfs_max_recordsize                                       1048576
        zfs_metaslab_find_max_tries                                  100
        zfs_metaslab_fragmentation_threshold                          70
        zfs_metaslab_max_size_cache_sec                             3600
        zfs_metaslab_mem_limit                                        25
        zfs_metaslab_segment_weight_enabled                            1
        zfs_metaslab_switch_threshold                                  2
        zfs_metaslab_try_hard_before_gang                              0
        zfs_mg_fragmentation_threshold                                95
        zfs_mg_noalloc_threshold                                       0
        zfs_min_metaslabs_to_flush                                     1
        zfs_multihost_fail_intervals                                  10
        zfs_multihost_history                                          0
        zfs_multihost_import_intervals                                20
        zfs_multihost_interval                                      1000
        zfs_multilist_num_sublists                                     0
        zfs_no_scrub_io                                                0
        zfs_no_scrub_prefetch                                          0
        zfs_nocacheflush                                               0
        zfs_nopwrite_enabled                                           1
        zfs_object_mutex_size                                         64
        zfs_obsolete_min_time_ms                                     500
        zfs_override_estimate_recordsize                               0
        zfs_pd_bytes_max                                        52428800
        zfs_per_txg_dirty_frees_percent                                5
        zfs_prefetch_disable                                           0
        zfs_read_history                                               0
        zfs_read_history_hits                                          0
        zfs_rebuild_max_segment                                  1048576
        zfs_rebuild_scrub_enabled                                      1
        zfs_rebuild_vdev_limit                                  33554432
        zfs_reconstruct_indirect_combinations_max                   4096
        zfs_recover                                                    0
        zfs_recv_queue_ff                                             20
        zfs_recv_queue_length                                   16777216
        zfs_recv_write_batch_size                                1048576
        zfs_removal_ignore_errors                                      0
        zfs_removal_suspend_progress                                   0
        zfs_remove_max_segment                                  16777216
        zfs_resilver_disable_defer                                     0
        zfs_resilver_min_time_ms                                    3000
        zfs_scan_checkpoint_intval                                  7200
        zfs_scan_fill_weight                                           3
        zfs_scan_ignore_errors                                         0
        zfs_scan_issue_strategy                                        0
        zfs_scan_legacy                                                0
        zfs_scan_max_ext_gap                                     2097152
        zfs_scan_mem_lim_fact                                         20
        zfs_scan_mem_lim_soft_fact                                    20
        zfs_scan_strict_mem_lim                                        0
        zfs_scan_suspend_progress                                      0
        zfs_scan_vdev_limit                                      4194304
        zfs_scrub_min_time_ms                                       1000
        zfs_send_corrupt_data                                          0
        zfs_send_no_prefetch_queue_ff                                 20
        zfs_send_no_prefetch_queue_length                        1048576
        zfs_send_queue_ff                                             20
        zfs_send_queue_length                                   16777216
        zfs_send_unmodified_spill_blocks                               1
        zfs_slow_io_events_per_second                                 20
        zfs_spa_discard_memory_limit                            16777216
        zfs_special_class_metadata_reserve_pct                        25
        zfs_sync_pass_deferred_free                                    2
        zfs_sync_pass_dont_compress                                    8
        zfs_sync_pass_rewrite                                          2
        zfs_sync_taskq_batch_pct                                      75
        zfs_traverse_indirect_prefetch_limit                          32
        zfs_trim_extent_bytes_max                              134217728
        zfs_trim_extent_bytes_min                                  32768
        zfs_trim_metaslab_skip                                         0
        zfs_trim_queue_limit                                          10
        zfs_trim_txg_batch                                            32
        zfs_txg_history                                              100
        zfs_txg_timeout                                                5
        zfs_unflushed_log_block_max                               262144
        zfs_unflushed_log_block_min                                 1000
        zfs_unflushed_log_block_pct                                  400
        zfs_unflushed_log_txg_max                                   1000
        zfs_unflushed_max_mem_amt                             1073741824
        zfs_unflushed_max_mem_ppm                                   1000
        zfs_unlink_suspend_progress                                    0
        zfs_user_indirect_is_special                                   1
        zfs_vdev_aggregate_trim                                        0
        zfs_vdev_aggregation_limit                               1048576
        zfs_vdev_aggregation_limit_non_rotating                   131072
        zfs_vdev_async_read_max_active                                 3
        zfs_vdev_async_read_min_active                                 1
        zfs_vdev_async_write_active_max_dirty_percent                 60
        zfs_vdev_async_write_active_min_dirty_percent                 30
        zfs_vdev_async_write_max_active                               10
        zfs_vdev_async_write_min_active                                2
        zfs_vdev_cache_bshift                                         16
        zfs_vdev_cache_max                                         16384
        zfs_vdev_cache_size                                            0
        zfs_vdev_default_ms_count                                    200
        zfs_vdev_default_ms_shift                                     29
        zfs_vdev_initializing_max_active                               1
        zfs_vdev_initializing_min_active                               1
        zfs_vdev_max_active                                         1000
        zfs_vdev_max_auto_ashift                                      16
        zfs_vdev_min_auto_ashift                                       9
        zfs_vdev_min_ms_count                                         16
        zfs_vdev_mirror_non_rotating_inc                               0
        zfs_vdev_mirror_non_rotating_seek_inc                          1
        zfs_vdev_mirror_rotating_inc                                   0
        zfs_vdev_mirror_rotating_seek_inc                              5
        zfs_vdev_mirror_rotating_seek_offset                     1048576
        zfs_vdev_ms_count_limit                                   131072
        zfs_vdev_nia_credit                                            5
        zfs_vdev_nia_delay                                             5
        zfs_vdev_queue_depth_pct                                    1000
        zfs_vdev_raidz_impl   cycle [fastest] original scalar sse2 ssse3
        zfs_vdev_read_gap_limit                                    32768
        zfs_vdev_rebuild_max_active                                    3
        zfs_vdev_rebuild_min_active                                    1
        zfs_vdev_removal_max_active                                    2
        zfs_vdev_removal_min_active                                    1
        zfs_vdev_scheduler                                        unused
        zfs_vdev_scrub_max_active                                      3
        zfs_vdev_scrub_min_active                                      1
        zfs_vdev_sync_read_max_active                                 10
        zfs_vdev_sync_read_min_active                                 10
        zfs_vdev_sync_write_max_active                                10
        zfs_vdev_sync_write_min_active                                10
        zfs_vdev_trim_max_active                                       2
        zfs_vdev_trim_min_active                                       1
        zfs_vdev_write_gap_limit                                    4096
        zfs_vnops_read_chunk_size                                1048576
        zfs_xattr_compat                                               0
        zfs_zevent_len_max                                           512
        zfs_zevent_retain_expire_secs                                900
        zfs_zevent_retain_max                                       2000
        zfs_zil_clean_taskq_maxalloc                             1048576
        zfs_zil_clean_taskq_minalloc                                1024
        zfs_zil_clean_taskq_nthr_pct                                 100
        zil_maxblocksize                                          131072
        zil_nocacheflush                                               0
        zil_replay_disable                                             0
        zil_slog_bulk                                             786432
        zio_deadman_log_all                                            0
        zio_dva_throttle_enabled                                       1
        zio_requeue_io_start_cut_in_line                               1
        zio_slow_io_ms                                             30000
        zio_taskq_batch_pct                                           80
        zio_taskq_batch_tpq                                            0
        zvol_inhibit_dev                                               0
        zvol_major                                                   230
        zvol_max_discard_blocks                                    16384
        zvol_prefetch_bytes                                       131072
        zvol_request_sync                                              0
        zvol_threads                                                  32
        zvol_volmode                                                   2

VDEV cache disabled, skipping section

ZIL committed transactions:                                       386.9k
        Commit requests:                                            9.8k
        Flushes to stable storage:                                  9.8k
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:        1.1 GiB      13.1k

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Long-winded technical post coming up; apologies in advance.

From your summary, it appears that the majority of your ARC is "L2ARC eligible" - the only pieces that aren't are probably the prefetched records (l2arc_noprefetch=1) - and the likely reason it isn't landing on the cache vdev is because of the limited feed rate (l2arc_write_max=8388608) or 8MB per cycle.

L2ARC is fed by scanning the tail end of "blocks that could potentially be evicted" and trying to pick the best candidates to put onto the cache vdev. The length of the "tail scan" is expressed as the product of "l2arc_write_max * l2arc_headroom * l2arc_headroom_boost%" during normal operation.

Taking the default values, you're scanning 8MB * 2 = 16MB every second, and selecting the most eligible 8MB to put on your SSDs. Not exactly a large amount. You can increase l2arc_write_max to a higher value but this has a few potential risks, so you'll want to adjust your tunables slowly (think "double at a time" not "10x at a time") and monitor the L2ARC hitrate and general read/write behavior of the array as you go.

First, your L2ARC device (SSD) will be spending additional time and bandwidth doing writes, and the purpose of the L2ARC is to provide fast reads. Most SSDs experience a "bathtub curve" of performance, where they're able to provide their marketing-rated performance at 100% reads or 100% writes, but once you move away from those scenarios and throw a mixed workload at it - even a 90%/10% blend - traditional NAND loses a chunk of its peak performance. Click the graphs for a full-size.

optane1.png optane2.png

Intel's 3D XPoint NAND (Optane) is the exception, but has a significantly higher price-tag, so it's not usually used for a read-only workload. If you're considering this, compare the cost of setting up a smaller all-SSD vdev instead, copying your "active data" there, and then migrating it to the "cold storage" tier manually afterwards.

Next, if the L2ARC feed thread is asked to scan through more of your ARC, it will consume additional CPU time. This might not be a huge impact if you've built a system with a lot of CPU, but it's a potential source of bottleneck - you also can't ask it to "feed every second" if a "scan and write" will take more than a second to complete.

Your workload doesn't have as much of a concern about burning out the disks, as by your estimation you're writing 400GB/week to the pool, but I'd like to make a note of caution for anyone else cruising by. If you manage to catch all 400GB every week and split it between your two cache devices that's 200GB/week or only 10.4TB/year. If you quadrupled your feed rate to 32MB/s, then you'll still only be writing 41.6TB/year to those disks. This is manageable. Someone who's constantly writing to their array, though - such as someone hosting virtual machines - even if they're only using the 8MB/s defaults, if their L2ARC device is being fed at 8MB/s, every second of every day - well, that's 675GB in a single day. Quadruple that to help your hitrates, and suddenly you're feeding 2.7T/day - that's enough to push you to almost 1PB in a year, beyond the endurance of most consumer SSDs (eg: the 1TB WD Red SA500 is rated for 600TBW)

Finally, if you're using a 10Gbps network card, bear in mind that it's going to heave data into your system at 1GB/s. You'll end up throttling back to the overall pool speed (as a 7-drive Z2 isn't likely to be able to sustain that) but even at a quarter of that speed, you'd need your SSD to be able to keep up with network-speed writes for the full size of your new data in order to "stay ahead of things." If the SSD can't hit those speeds, or hiccups because it's being asked to handle a read workload (as is L2ARC's job!) then ZFS will just evict the data from ARC and not copy it - it isn't going to hit the brakes on a pending ARC eviction in order to fill the cache vdev.

TL;DR try gradually increasing the value of l2arc_write_max but monitor your array for signs of CPU contention and your SSD's lifespan as "butterfly effects."
 
Top