ZFS Deduplication
13 minute read.
ZFS supports deduplication as a feature. Deduplication means that identical data is only stored once, which can significantly reduce storage size. However, deduplication is a compromise and balance between many factors, including cost, speed, and resource needs. Consider and understand the implications of using deduplication before adding it to a pool.
Deduplication is one technique ZFS can use to store files and other data in a pool. If several files contain the same pieces (blocks) of data or any other pool data occurs more than once in the pool, ZFS stores just one copy of it. Instead of storing many copies of a book it stores one copy and an arbitrary number of pointers to that one copy. Only when no file uses the data is the data deleted. ZFS keeps a reference table that links files and pool data to the actual storage blocks containing their data. This is the deduplication table (DDT).
The DDT is a fundamental ZFS structure and is part of the metadata or the pool. If a pool (or any dataset in the pool) has ever contained deduplicated data, the pool contains a DDT, and that DDT is as fundamental to the pool data as any of its other file system tables. Like any other metadata, DDT contents might be temporarily held in the ARC (RAM/memory cache) or L2ARC (disk cache) for speed and repeated use, but the DDT is not a disk cache. It is a fundamental part of the ZFS pool structure and how ZFS organizes pool data on its disks. Therefore, like any other pool data, if DDT data is lost, the pool is likely to become unreadable. DDT is not needed for reads, but any writes or deletions of deduplicated blocks. So, it is important to store on redundant devices.
A pool can contain any coexisting mix of deduplicated data and non-deduplicated data. If deduplication is enabled at the time of writing, the DDT is used to write data. It writes non-deduplicated data if deduplication is not enabled at the time of writing. Subsequently, the data remains the same as it was at the time it was written until it is deleted.
The only way to convert existing current data to all deduplicated or non-deduplicated or to change how it is deduplicated is to create a new copy while new settings are active.
Copy the data within a file system or to a different file system, or replicate it using the Web UI replication functions.
Data in snapshots is fixed and can only be changed by replicating the snapshot to a different pool with different settings (which preserves its snapshot status) or copying its contents.
It is possible to stipulate deduplicating only specified datasets and volumes in a pool. The DDT encompasses the entire pool, but only data in these locations is deduplicated when written. Other data not well deduplicated or where deduplication is not appropriate, is not deduplicated when written, saving resources.
Fast deduplication is a feature included in OpenZFS 2.3.0. It makes backend changes to legacy deduplication in ZFS that improve performance and can reduce latency in some use cases. These improvements speed up I/O processes, look-ups, and reclaim storage space, and in situations where pools are handling reasonable workloads, improve latency over legacy dedup. Fast deduplication accomplishes these improvements through four new functions: DDT log, prefetch, pruning, and a quota.
Instead of writing DDT entries in random order as they arrive, which causes excessive write inflation, and since single DDT record writes might require a whole ZFS attribute processor (ZAP) leaf block, fast dedup temporarily writes them into a log, flushing it into actual DDT ZAP only after sorting. Improving write locality allows aggregating multiple DDT entry writes into one ZAP leaf write.
Prefetch fills the ARC cache by loading deduplication tables into it. Loading the DDT into memory speeds up operations by reducing on-demand disk reads for every record the system processes. The prefetch is particularly important in systems with large deduplication tables (DDTs) where loading the table on demand can take days after an import/reboot. The prefetch might also be used to reload portions of a DDT evicted due to inactivity into the ARC.
Pruning cleans up old, non-duplicate (unique) records in the deduplication table (DDT) to reclaim storage and improve performance in ZFS when the DDT becomes too large. Reclaiming available space is a prerequisite to the deduplication table (DDT) quota and pruning functions.
Quota manages the deduplication table (DDT) by keeping it from unbounded growth that can hurt RAM and performance. Setting a quota for the on-disk DDT effectively disables new entries for blocks if the allotted space reaches the upper limit. It works for both legacy and fast dedup tables.
There are three quota options: Auto, Custom, and None. Auto is the default option that allows the system to determine the quota, and the size of a dedicated dedup vdev is used as the quota limit. Custom allows administrators to set a quota. None leaves the DDT unrestricted and disables the quota.
The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space. When data is read or written, it is also possible that a large block read or write can be replaced by a smaller DDT read or write, reducing disk I/O size and quantity.
The deduplication process is very demanding! There are four main costs to using deduplication: large amounts of RAM, requiring fast SSDs, CPU resources, and a general performance reduction. The trade-off with deduplication is reduced server RAM/CPU/SSD performance and loss of top-end I/O speeds in exchange for saving storage size and pool expenditures.
When data is not sufficiently duplicated, deduplication wastes resources, slows the server down, and has no benefit. When data is already being heavily duplicated, then consider the costs, hardware demands, and impact of enabling deduplication before enabling on a ZFS pool.
High-quality mirrored SSDs configured as a special vdev for the DDT (and usually all metadata) are strongly recommended for deduplication unless the entire pool is built with high-quality SSDs. Expect potentially severe issues if these are not used as described below. NVMe SSDs are recommended whenever possible. SSDs must be large enough to store all metadata.
The deduplication table (DDT) contains small entries about 300-900 bytes in size. It is primarily accessed using 4K reads. This places extreme demand on the disks containing the DDT.
When choosing SSDs, remember that a deduplication-enabled server can have considerable mixed I/O and very long sustained access with deduplication. Try to find real-world performance data wherever possible. It is recommended to use SSDs that do not rely on a limited amount of fast cache to bolster a weak continual bandwidth performance. Most SSD performance (latency) drops when the onboard cache is fully used and more writes occur. Always review the steady state performance for 4K random mixed read/write.
Special vdev SSDs receive continuous, heavy I/O. HDDs and many common SSDs are inadequate. As of 2021, some recommended SSDs for deduplicated ZFS include Intel Optane 900p, 905p, P48xx, and better devices. Lower-cost solutions are high-quality consumer SSDs such as the Samsung EVO and PRO models. PCIe NVMe SSDs (NVMe, M.2 “M” key, or U.2) are recommended over SATA SSDs (SATA or M.2 “B” key).
When special vdevs cannot contain all the pool metadata, then metadata is silently stored on other disks in the pool. When special vdevs become too full (about 85%-90% usage), ZFS cannot run optimally and the disks operate slower. Try to keep special vdev usage under 65%-70% capacity whenever possible. Try to plan how much future data you wan to add to the pool, as this increases the amount of metadata in the pool. More special vdevs can be added to a pool when more metadata storage is needed.
Deduplication is memory intensive. When the system does not contain sufficient RAM, it cannot cache DDT in memory when read, and system performance can decrease.
The RAM requirement depends on the size of the DDT and the amount of stored data to be added to the pool, and the more duplicated the data, the fewer entries, and smaller DDT. Pools suitable for deduplication, with deduplication ratios of 3x or more (data can be reduced to a third or less in size), might only need 1-3 GB of RAM per 1 TB of data. The actual DDT size can be estimated by deduplicating a limited amount of data in a temporary test pool.
Deduplication consumes extensive CPU resources and it is recommended to use a high-end CPU with 4-6 cores at minimum.
If deduplication is used in an inadequately built system, these symptoms might be seen:
- Cause: The system must perform disk I/O to fetch DDT entries, but these are usually 4K I/O and the underlying disk hardware is unable to cope in a timely manner.
- Solutions: Add high-quality SSDs as a special vdev and either move the data or rebuild the pool to use the new storage.
- Cause: This is a byproduct of the disk I/O slowdown issue. Network buffers can become congested with incomplete demands for file data and the entire ZFS I/O system is delayed by tens or hundreds of seconds because huge amounts of DDT entries have to be fetched. Timeouts occur when networking buffers can no longer handle the demand. Because all services on a network connection share the same buffers, all become blocked. This is usually seen as file activity working for a while and then unexpectedly stalling. File and networked sessions then fail too. Services can become responsive when the disk I/O backlog clears, but this can take several minutes. This problem is more likely when high-speed networking is used because the network buffers fill faster.
- Cause: When ZFS has fast, special vdev SSD disks, sufficient RAM, and is not limited by disk I/O, then the hash calculation becomes the next bottleneck. Most of the ZFS CPU consumption is from attempting to keep hashing up to date with disk I/O. When the CPU is overburdened, the console becomes unresponsive and the web UI fails to connect. Other tasks might not run properly because of timeouts. This is often encountered with pool scrubs and it can be necessary to pause the scrub temporarily when other tasks are a priority.
- Diagnose: An easily seen symptom is that console logins or prompts take several seconds to display.
Generally, multiple entries with command
kernel {z_rd_int_[NUMBER]}
can be seen using the CPU capacity, and the CPU is heavily (98%+) used with almost no idle. - Solutions: Changing to a higher-performance CPU can help but might have limited benefits. 40-core CPUs have been observed to struggle as much as 4- or 8-core CPUs. A usual workaround is to temporarily pause scrub and other background ZFS activities that generate large amounts of hashing. It can also be possible to limit I/O using tunables that control disk queues and disk I/O ceilings, but this can impact general performance and is not recommended.