5

I have a long-running ZFS pool under Ubuntu, which has been through many upgrades. After upgrade from Ubuntu 20 to 22, the encrypted filesystems refuse to mount, though the rest seem OK. zpool status -v reports a permanent error ZFS-8000-8A at the root of the encrypted filesystem. I note that there's a big jump in the package version of zfsutils-linux from 0.8.3 to 2.1.5.

The pool can still be imported to an Ubuntu 20 system, and the encrypted filesystem can still be mounted there. It looks OK. What's more, I'm able to zfs send from Ubuntu 20 and zfs receive on Ubuntu 22, with apparent success.

If I zfs send the entire pool (or do it in parts) from Ubuntu 20 (ZFS 0.8.3) to Ubuntu 22 (ZFS 2.1.5) and the operation succeeds, will I have created a pool free of upgrade problems? That is, will the receive operation build a pool that is fully up-to-date with ZFS? Or could the compatibility problems come across the link?

I do not know enough about the level at which zfs send/receive operates to be sure that I won't have further corruption under Ubuntu 22.

I'm happy to re-encrypt the encrypted filesystems if necessary, rather than sending them raw. Everything would be happening locally. In this case, the Ubuntu 20 is running in an LXD VM with the disk devices attached, and the send/receive pipe is not crossing any network.

Please note that I am aware of the advice to scrap the entire pool and restore it from backup. I'm attempting to recover without having to resort to that, since I appear to be able to read the pool.

All disks pass long SMART tests, and scrubbing the pool under Ubuntu 20 shows no errors.

I'd be very pleased to be told (with citations) that the error is confined to the encrypted filesystems and that I can replace them and proceed without rebuilding the whole pool, but I don't know enough about ZFS internals to be sure of that. I would be interested in how to find out.

2 Answers 2

1

First, while the upgrade from 0.8.3 to 2.1.5 is surely a big one, it should not end with a (partially) broken pool. So I suggest opening a GitHub issue (or writing to the zfs-discuss mailing list).

That said, sending the pool without --raw or --compress options surely is a good strategy to get all the data back: a full zfs send is very similar to logically traversing the entire pool with, say, rsync (there are important differences in recordsize handling but they are not significant with regard to encryption).

If that fails, you can use a plain simple rsync between the two pools/datasets.

1
1

I have worked around the problem and posted details on the zfs-discuss mailing list. I'll post the info here in case it helps others. My solution did involve long server downtime. There are patches available in Ubuntu bug #1987190 that may work better if you can't tolerate that.

I found several ZFS issues on GitHub relating to my problems:

The bug is to do with metadata in filesystems created with native encryption with older ZFS versions.

I repaired the pool like this:

  1. Installed Ubuntu 22 on an external drive with a fresh ZFS pool and booted the server on that.

  2. Ran Ubuntu 20 in a VM in that system, and attached the server's pool member disks to that VM.

  3. Import the old pool in to the VM.

  4. Mount the "damaged" filesystems in the VM.

  5. zfs send from old pool in the VM to zfs receive in the new pool in the host system. This fixes the metadata.

  6. Check, recheck, verify that all the data made it. (I used rsync -n --checksum on critical snapshots in .zfs/snapshot.)

  7. Extended SMART tests and full scrub on the old pool.

  8. Boot the server back up into its Ubuntu 22. No rollback was necessary.

  9. Import the fresh ZFS pool from the external disk.

  10. Destroy the damaged filesystems.

  11. zfs send from the fresh pool to the old pool, recreating the damaged filesystems.

I did it this way because I was suspicious of the integrity of the old pool and did not want to touch it more than necessary. In fact, I think I could have kept running the server on the old pool in Ubuntu 22 and started the Ubuntu 20 VM on the server. But of course this would've failed if the OS itself was in a damaged filesystem.

I hope this helps anyone else with the problem.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .