Sequential write to zvol is much slower than to a dataset on the same zpool

Question

While experimenting with Proxmox VE, we have encountered a strange performance problem:

VM disks can be stored (among other options) as individual raw ZFS zvols, or as qcow2 files on a single common dataset.
For some reason, sequential write performance to the zvols is massively worse than to the dataset, even though both reside on the same zpool.
This doesn't affect the VM's normal operation noticeably, but makes a massive difference when hibernating/RAM-snapshotting the VM (140 sec vs 44 sec for hibernating 32 GB RAM).

How can this occur when it's all the same data on the same zpool?

Here's what the write performance looks like on (1) the dataset, (2) a zvol created by Proxmox, and (3) a manually created zvol with a larger volblocksize. Strangely, the write throughput becomes slightly faster when (4) creating a filesystem on the exact same zvol and writing to that instead.
test.bin contains 16 GiB of urandom data to circumvent ZFS compression. I've run each test a few times and end up roughly in the same ballpark, so caching doesn't seem to be much of a factor.

# rpool/ROOT                  recordsize    128K      default
> dd if=/mnt/ramdisk/test.bin of=/var/lib/vz/images/test.bin bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 20.9524 s, 820 MB/s
# with conv=fdatasync, this drops to about 529 MB/s

# rpool/data/vm-112-disk-0    volblocksize  8K        default
> dd if=/mnt/ramdisk/test.bin of=/dev/zvol/rpool/data/vm-112-disk-0 bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 67.7121 s, 254 MB/s
# with conv=fdatasync, this drops to about 151 MB/s

# rpool/data/vm-112-disk-2    volblocksize  128K      -
> dd if=/mnt/ramdisk/test.bin of=/dev/zvol/rpool/data/vm-112-disk-2 bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 35.2894 s, 487 MB/s
# with conv=fdatasync, this drops to about 106 MB/s

> mkfs.ext4 /dev/zvol/rpool/data/vm-112-disk-2
> mount /dev/zvol/rpool/data/vm-112-disk-2 /mnt/tmpext4
> dd if=/mnt/ramdisk/test.bin of=/mnt/tmpext4/test.bin bs=1M status=progress
17179869184 bytes (17 GB, 16 GiB) copied, 23.7413 s, 724 MB/s
# with conv=fdatasync, this drops to about 301 MB/s

The system and zpool setup looks like this:

> uname -r
5.4.78-2-pve

> zfs version
zfs-0.8.5-pve1
zfs-kmod-0.8.5-pve1

> zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:10:19 with 0 errors on Sun Mar 14 00:34:20 2021
config:

    NAME                                                  STATE     READ WRITE CKSUM
    rpool                                                 ONLINE       0     0     0
      raidz2-0                                            ONLINE       0     0     0
        ata-ST4000NE001-xxxxxx_xxxxxxxV-part3             ONLINE       0     0     0
        ata-ST4000NE001-xxxxxx_xxxxxxx4-part3             ONLINE       0     0     0
        ata-ST4000NE001-xxxxxx_xxxxxxx3-part3             ONLINE       0     0     0
        ata-ST4000NE001-xxxxxx_xxxxxxx9-part3             ONLINE       0     0     0
        ata-ST4000NE001-xxxxxx_xxxxxxx6-part3             ONLINE       0     0     0
        ata-ST4000NE001-xxxxxx_xxxxxxxP-part3             ONLINE       0     0     0
    logs    
      mirror-1                                            ONLINE       0     0     0
        ata-INTEL_SSDSC2BB800G6_xxxxxxxxx5xxxxxxxx-part2  ONLINE       0     0     0
        ata-INTEL_SSDSC2BB800G6_xxxxxxxxx6xxxxxxxx-part2  ONLINE       0     0     0
    cache
      ata-INTEL_SSDSC2BB800G6_xxxxxxxxx5xxxxxxxx-part1    ONLINE       0     0     0
      ata-INTEL_SSDSC2BB800G6_xxxxxxxxx6xxxxxxxx-part1    ONLINE       0     0     0

errors: No known data errors

Kyle · Accepted Answer · 2022-03-01 20:42:08Z

You were (are) likely experiencing OpenZFS issue #11407 [Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs.

quote from sempervictus:

ZVOLs are pretty pathologically broken by design. A block device is a range of raw bits, full stop. A zvol is an abstraction presenting a range but mapping them across noncontiguous space with deterministic logic involved in linearizing it. So architecturally, it inherently will be slower to resolve a request. The fact that request paths elongate with data written prior to the zvol amplifies and exacerbates the poor design.

Atop that, zvols are almost never considered when changes are introduced, with performance regression after regression going into zfs for years and maintainers never having time or interest in the feature.

I've personally done extensive benchmarking of mechanical drives (spinning rust) for this exact issue and kvm disk images (my preference today is raw) stored on datasets beat zvol in nearly all test cases, and exhibited "normal" system load, whereas zvol tests were nearly always causing orders of magnitude more load. Some zvol test configs fio with buffered=1 also reliably caused system instability, lock ups and crashes.

Stack Exchange Network

Sequential write to zvol is much slower than to a dataset on the same zpool

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
zfs
proxmox
zfsonlinux
.

Hot Network Questions

Sequential write to zvol is much slower than to a dataset on the same zpool

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged zfsproxmoxzfsonlinux.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
zfs
proxmox
zfsonlinux
.