TRIM takes ages to complete.

IvoT · Dec 20, 2023

The iostat output is far better in terms of details, but anyway here is a screenshot for two of the disks, it is the same on all others:

IvoT · Dec 20, 2023

Here is the output of iostat.

joeschmuck · Dec 20, 2023

IvoT said:
Here is the iostat output as well https://pastecode.io/s/wa6y4i6i (view in fullscreen for better readability)

Please do not use external links, many of our forum people will not click on the links. Please post directly in our forums. And this site has a lot of popups! How many ways can we infect a computer, let me count the ways. The attached text file is much better for me to view.

From the data provided, while it may not be a lot of data, it does appear that 'writing data' to be virtually continuous. I have no idea if that is contributing to your problem or not.

Another piece of information you have not provided is the version of SCALE you are running now and what were you running when your TRIM times were good? I'm trying to take notice in your postings, you just joined the forum yesterday (thanks for joining), you have been running TrueNAS for months but what version of TrueNAS were you running? Did you run CORE and upgrade to SCALE, what version of SCALE? We need some more history. Did this problem start after an upgrade? Can you roll back to the previous version that worked to verify it TRIM works and the current version of SCALE is causing your issue?

If you upgraded your pool, you may not be able to roll back, thus I never upgrade my pools as the new features are something that I would not use and they restrict me from rolling back to a previous version. Just something worth noting.

HoneyBadger · Dec 20, 2023

Hey @IvoT

The defaults in ZFS are to limit a maximum of 10 TRIM commands in queue per leaf vdev - so with your 8x SSD's in RAIDZ2, you're averaging just over one TRIM command in queue per physical disk, and it's also aiming to aggregate your TRIMs across transaction groups as well. This is designed to limit the TRIM speed in order not to impact pool I/O.

SAS drives do better with discards than SATA, so you may be able to use autotrim in your situation, and simply let the VMFS layer pass the UNMAP commands down the chain (VMware can also rate-limit the discard speed)

I need to make a longer effortpost on TRIM in general. Someone nag me if I haven't done it in a reasonable amount of time.

IvoT · Dec 20, 2023

joeschmuck said:
Please do not use external links, many of our forum people will not click on the links. Please post directly in our forums. And this site has a lot of popups! How many ways can we infect a computer, let me count the ways. The attached text file is much better for me to view.

From the data provided, while it may not be a lot of data, it does appear that 'writing data' to be virtually continuous. I have no idea if that is contributing to your problem or not.

Another piece of information you have not provided is the version of SCALE you are running now and what were you running when your TRIM times were good? I'm trying to take notice in your postings, you just joined the forum yesterday (thanks for joining), you have been running TrueNAS for months but what version of TrueNAS were you running? Did you run CORE and upgrade to SCALE, what version of SCALE? We need some more history. Did this problem start after an upgrade? Can you roll back to the previous version that worked to verify it TRIM works and the current version of SCALE is causing your issue?

If you upgraded your pool, you may not be able to roll back, thus I never upgrade my pools as the new features are something that I would not use and they restrict me from rolling back to a previous version. Just something worth noting.

Note taken for the external links - I will attach logs directly from now on.
The 'writing data' is the actual trimming taking place. If you look in iostat it is clearly seen that dMB/s is causing it (which is discarded mb/s).
I am running TrueNAS-23.10.0.1. I don't remember on what version was TRIM working fast, but it was not upgraded from CORE, it was a clean install.

HoneyBadger said:
Hey @IvoT

The defaults in ZFS are to limit a maximum of 10 TRIM commands in queue per leaf vdev - so with your 8x SSD's in RAIDZ2, you're averaging just over one TRIM command in queue per physical disk, and it's also aiming to aggregate your TRIMs across transaction groups as well. This is designed to limit the TRIM speed in order not to impact pool I/O.

Is there a way to rise that limit? I don't have a lot of workload on the pool so I prefer faster trim times than pool IO.

HoneyBadger · Dec 20, 2023

@IvoT What does the output of zpool iostat -vq YourPoolName 5 show? The trimq_write columns should show how deep your current TRIM queues are.

Your iostat output also shows some fairly small dareq-sz (delete average request size) values of ~128K only - to contrast, I manually TRIMmed a set of four SSDs here and was getting much larger chunks with higher throughput.

Code:

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0             0.00      0.00     0.00   0.00    0.08    21.75    0.00      0.00     0.00   0.00    0.00     2.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
nvme0n1          0.20      0.00     0.00   0.00    0.23    21.23   30.60      0.10     0.00   0.00    0.04     3.45    0.00      0.00     0.00   0.00    0.00     0.00    0.42    1.45    0.00   0.36
sda              0.00      0.00     0.00   0.00    3.28    28.88    0.00      0.00     0.00   0.00    0.98    10.05    0.00      0.00     0.00   0.00    0.00     0.00    0.00    4.75    0.00   0.00
sdb              0.00      0.00     0.00   0.00    3.29    27.59    0.00      0.00     0.00   0.00    0.89    10.05    0.00      0.00     0.00   0.00    0.00     0.00    0.00    4.75    0.00   0.00
sdc              0.01      0.00     0.00   0.00    3.50    25.75    0.00      0.00     0.00   0.00    1.49    10.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00   12.42    0.00   0.00
sdd              0.01      0.00     0.00   0.00    3.84    26.30    0.00      0.00     0.00   0.00    1.51    10.63    0.00      0.00     0.00   0.00    0.00     0.00    0.00   12.58    0.00   0.00
sde              0.06      0.00     0.00   0.00    1.38    20.33   17.97      0.12     0.04   0.25    0.06     6.98    0.27     22.90     0.00   0.20    1.11 85860.53    0.54    0.06    0.00   0.53
sdf              0.06      0.00     0.00   0.00    1.87    19.59   17.91      0.12     0.05   0.27    0.06     6.99    0.27     22.90     0.00   0.19    1.11 86084.19    0.54    0.06    0.00   0.53
sdg              0.06      0.00     0.00   0.00    0.25    19.19   17.97      0.12     0.05   0.25    0.06     6.98    0.27     22.90     0.00   0.19    1.12 86080.18    0.54    0.06    0.00   0.53
sdh              0.06      0.00     0.00   0.02    0.25    19.47   17.92      0.12     0.05   0.25    0.06     6.99    0.27     22.90     0.00   0.18    1.11 85844.60    0.54    0.06    0.00   0.53
sdi              0.00      0.00     0.00   0.00    3.35    26.74    0.00      0.00     0.00   0.00    1.47    10.14    0.00      0.00     0.00   0.00    0.00     0.00    0.00   11.58    0.00   0.00
sdj              0.00      0.00     0.00   0.00    3.81    27.19    0.00      0.00     0.00   0.00    1.45     9.99    0.00      0.00     0.00   0.00    0.00     0.00    0.00   12.50    0.00   0.00
sdk              0.00      0.00     0.00   0.00    3.83    27.53    0.00      0.00     0.00   0.60    1.62    10.40    0.00      0.00     0.00   0.00    0.00     0.00    0.00    5.00    0.00   0.00
sdl              0.01      0.00     0.00   0.00    3.11    26.27    0.00      0.00     0.00   0.59    1.07    10.11    0.00      0.00     0.00   0.00    0.00     0.00    0.00    7.33    0.00   0.00
zd0              0.00      0.00     0.00   0.00    0.00    21.75    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00

A virtualization workload does produce a lot of small records. I skimmed through the posts, but didn't see an answer - has there been a significant amount of write I/O to the system since the last manual TRIM?

IvoT · Dec 20, 2023

Sure there was, my homelab is running on that and for the last few months it may have produced some 50-60TB of writes.
Here is the output of the command:

Code:

                                            capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write  rebuildq_write
pool                                      alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
SAS                                       12.7T  15.2T     42    699   249K  4.51M      0      0      0      0      0      0      0      0      0      0     63     16      0      0
  raidz2-0                                12.7T  15.2T     42    699   249K  4.51M      0      0      0      0      0      0      0      0      0      0     63     16      0      0
    547ba0f9-ba44-4f01-ad33-4bda61310bdd      -      -      5     81  30.4K   560K      0      0      0      0      0      0      0      0      0      0      8      2      0      0
    ae6f3e47-7d8e-45ff-b56c-0907c47b83bf      -      -      5     88  29.6K   577K      0      0      0      0      0      0      0      0      0      0      8      2      0      0
    555672f7-1f29-451d-8fad-820612d19d05      -      -      5     87  30.4K   573K      0      0      0      0      0      0      0      0      0      0      8      2      0      0
    fba71670-e645-4906-818b-20238c8f97fd      -      -      4     89  29.6K   584K      0      0      0      0      0      0      0      0      0      0      8      2      0      0
    5e41b921-b752-490d-bf6d-b79ecb9b1c22      -      -      5     90  34.4K   584K      0      0      0      0      0      0      0      0      0      0      8      2      0      0
    d1b7284c-ab83-498c-bad3-72ea74a7f6c8      -      -      5     89  31.2K   579K      0      0      0      0      0      0      0      0      0      0      8      2      0      0
    03efaaf3-eb40-492c-8f16-87c2b88f2aed      -      -      5     86  32.0K   586K      0      0      0      0      0      0      0      0      0      0      8      2      0      0
    b26c8881-71ed-40b7-95c4-e45df9b4d9e1      -      -      5     87  31.2K   577K      0      0      0      0      0      0      0      0      0      0      7      2      0      0
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

HoneyBadger · Dec 20, 2023

Instructions below. If you're coming here from a search, future reader, bear in mind the impacts discussed as well rather than just applying a tunable blindly.

You can change the per-device TRIM limit with:

echo N >> /sys/module/zfs/parameters/zfs_vdev_trim_max_active

The default value is 2 but you can increase N up to the value of zfs_vdev_max_active - you'll with the understanding that increased TRIM/UNMAP activity will negatively impact pool I/O.

Because of your vdev drive count you'll also possibly have to bump up the per-vdev limit:
echo N >> /sys/module/zfs/parameters/zfs_trim_queue_limit

Again - more TRIM means less actual I/O. Increase gradually, monitor your iostat and zpool iostat as well as general application latencies.

IvoT · Dec 20, 2023

I've tried the tunables, but nothing major happend, beside the trimq_write pending dropped to 0 and maybe 2-3 dMB/s faster . I've tried a few values and it seems the zfs_trim_queue_limit does noting in reality, as long as you have zfs_vdev_trim_max_active at least the same as zfs_trim_queue_limit the trimq_write pending goes to 0 and that's it. In my case it seems something is not sending discards fast enough. Something is not letting per-vdev active to go above 4 after pending goes to 0.

HoneyBadger · Dec 21, 2023

You mentioned you have one of these SSDs in another Linux machine and it trims significantly faster (400-500MB/s) - presumably using a different filesystem. Is there any chance you can collect some iostat -mx dumps from that one as well, to compare the delete column metrics?

Also, if you can show an hdparm -I /dev/sdX output, specifically with regards to the reported logical/physical sector size (since it's a Samsung, I expect both to report as 512b) and the number of TRIM blocks supported (under the "Data Set Management TRIM supported (limit 8 blocks)"

IvoT · Dec 21, 2023

I will try with the other system at some point, because I don't have access to it right now. Here is the output of hdparm, but this is a SAS drive, so it may not be accurate:

Code:

/dev/sdb:

ATA device, with non-removable media
Standards:
    Likely used: 1
Configuration:
    Logical        max    current
    cylinders    0    0
    heads        0    0
    sectors/track    0    0
    --
    Logical/Physical Sector size:           512 bytes
    device size with M = 1024*1024:           0 MBytes
    device size with M = 1000*1000:           0 MBytes 
    cache/buffer size  = unknown
Capabilities:
    IORDY not likely
    Cannot perform double-word IO
    R/W multiple sector transfer: not supported
    DMA: not supported
    PIO: pio0

Smartctl reports this:
Logical block size: 512 bytes
Physical block size: 4096 bytes

Where to get the "Data Set Management TRIM supported (limit 8 blocks)"?

HoneyBadger · Dec 21, 2023

Right, SAS vs SATA.

Try sdparm /dev/sdb -p bl | grep unmap

Code:

  Maximum unmap LBA count: -1 [unbounded]
  Maximum unmap block descriptor count: -1 [unbounded]
  Optimal unmap granularity: 8 blocks

Edit: While we're here, how about sdparm /dev/sdb --get WCE to see if write caching is enabled?

IvoT · Dec 22, 2023

Maximum unmap LBA count: -1 [unbounded]
Maximum unmap block descriptor count: -1 [unbounded]
Optimal unmap granularity: 16 blocks

WCE 1 [cha: y, def: 1, sav: 1]

HoneyBadger · Dec 22, 2023

This could be a result of a fragmented pool - if it has to unmap in tiny little blocks rather than the optimal granularity (double my drive in question) your drives may have to spend more time doing internal housekeeping to ensure they keep the valuable information while only zapping that which is supposed to be blanked out.

How far has it progressed thus far; and the real test will be "if you TRIM again after a day or two, will it then go much faster?"

IvoT · Dec 22, 2023

Currently it is at 71% trimmed. I will retry after a day or two, but I don't think it will be faster. To me it seems like for some reason truenas is not sending enough unmap requests to the pool.

HoneyBadger · Dec 22, 2023

IvoT said:
Currently it is at 71% trimmed. I will retry after a day or two, but I don't think it will be faster. To me it seems like for some reason truenas is not sending enough unmap requests to the pool.

It may be something specific to your drives (PM1643) as it doesn't seem to have an issue pushing UNMAPs to my SSDs - although not at the same speed as the 400-500MB/s you mentioned, but mine are both in-use and not quite the same speed as your Samsungs. Might be worth submitting a bug/Jira ticket for this, and including a debug to indicate the slow TRIM performance even when manually launched.

IvoT · Dec 22, 2023

Lets 1st see what happens after it finishes. I will report back when it is complete.

IvoT · Dec 23, 2023

It finished. Started another one, same thing - goes as slow as it was before.

Davvo · Dec 26, 2023

From block magic to black magic. I'd file a bug report.

NugentS · Jan 20, 2024

HoneyBadger said:
Hey @IvoT

The defaults in ZFS are to limit a maximum of 10 TRIM commands in queue per leaf vdev - so with your 8x SSD's in RAIDZ2, you're averaging just over one TRIM command in queue per physical disk, and it's also aiming to aggregate your TRIMs across transaction groups as well. This is designed to limit the TRIM speed in order not to impact pool I/O.

SAS drives do better with discards than SATA, so you may be able to use autotrim in your situation, and simply let the VMFS layer pass the UNMAP commands down the chain (VMware can also rate-limit the discard speed)

I need to make a longer effortpost on TRIM in general. Someone nag me if I haven't done it in a reasonable amount of time.

NAG

Important Announcement for the TrueNAS Community.

TRIM takes ages to complete.

Dabbler

Dabbler

Attachments

Old Man

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

actually does care

Dabbler

Dabbler

MVP

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TRIM takes ages to complete."

Similar threads