ZFS DD'ing 512b sector hard drive to ashift 12 (4Kb sector) ZFS volume

malavon · Jul 10, 2023

(edited to add block-by-block, 1:1)

Well, looks like there's yet another thing I can't seem to understand or grasp my head around. I've been trying to do this for several days and just can't get it right. Many mistakes were made, but I just can't figure out the right solution. .

What I'm trying to do sounds rather simple: I have a hard disk with 512 byte sectors. It's an old one, from before 4k sectors were even a consideration. Now I want to block-by-block 1:1 copy the entire disk to a ZFS volume on a pool which has a minimum ashift of 12, aka 4096. This sounds pretty much what ZFS volumes were intented to do to me. I've been searching the internet but either I'm looking at all the wrong info or I'm the first to try this rather trivial sounding task.

So, I started with the following (modified for simplicity):
* zfs create -V 1T pool/volume: create volume, leaving blocksize untouched (resulting in 8K blocks default setting on top of ashift 12 sysctl created pool)
* dd if=/dev/harddisk of=/dev/zvol/pool/volume bs=256m: copy the data, blocksize set to speed up the process
* zpool import ...: import the zpool on the disk, which shows me this: "UNAVAIL insufficient replicas"

At this point, I understood what happened. The 512 byte blocks were copied onto 8K blocks with 16 at a time (or 4K blocks with 8 at a time), leaving the filesystem without the correct number of blocks. Logically the next step was to reduce the size of the blocks in the newly created volume to 512. From that point things just didn't work out as planned.

Doing the exact same steps, but with block size 512. This time I used dd with a smaller blocksize because of something in the manpage on the "bs" operand: "...then each input block is copied to the output as a single block without any aggregation of short blocks." Great, sounds like what I need, right?
* zfs create -b 512 -V 1g pool/volume: create volume, leaving blocksize untouched (resulting in 8K blocks)
* dd if=/dev/harddisk of=/dev/zvol/pool/volume bs=512: copy the data, blocksize set to speed up the process
I didn't get to the point where I can import the old pool because my volume pool ran out of space, I checked and apparently the new volume used 8 times (4096/512, not 8192/512) more disk space then it should. I'm fairly certain that if I had had the disk space it would have worked because at least the amount of blocks should be correct. Still that's a massive overhead.

So, I thought that maybe I should somehow use 4K or 8K blocks together with ZLE (runs of zeroes) compression and pad every 512 byte block simply with zeroes. I scoured the dd(1) page and it looks like I found options which would do exactly that: ibs/obs/conv=sync. Especially after I read on an IBM manpage (which I thought were the original creators of dd - apparently it's AT&T):

The conv=sync is required when reading from disk and when the file size is not a multiple of the diskette block size. Do not try this if the input to the dd command is a pipe instead of a file, it will pad most of the input with nulls instead of just the last block.

I started experimenting and have probably - hopefully - tried every single combination of these 3 options many times over (and others, like fsync etc) hoping to get the right output. I tried piping dd from one reading process to a writing process dd if=... | dd of=.. testing out all of these options on both sides (again, I hope). But nothing, absolutely nothing. Whatever I tried, I couldn't get the output of dd to pad blocks (I checked this thoroughly by using test files and hexdump).

Disappointed I looked back to the metaphorical other side of the equation: the ZFS volume itself. In the OpenZFS documentation I read the following: "It’s meaningless to set volblocksize less than guest FS’s block size or ashift". Ok, fine, maybe that's the reason. Next experimentation: sysctl vfs.zfs.min_auto_ashift=9. That should solve things, right? ... Spoiler alert: I would be writing a tutorial instead of a question if it did.

And that's pretty much where I hit 2 dead ends. Somehow I need to solve this, but I'm all out of ideas. If anyone has a suggestion, or can help me get my ZFS or DD command line right, I'm all for it.
For the record: there is a simple solution and that is to forego ZFS volumes and use a simple dd image on a filesystem/ZFS dataset. That's perfectly fine to work with, but I have this feeling that this is what ZFS volumes are intended for and it should be possible. Especially folks trying to virtualize physical machines should hit this over and over again right? I can't understand why the internet isn't filled with this question so either it must be extremely trivial and I'm just dense, or people resorted to workarounds like I'm probably going to.

If it helps to try out some things, here's the last script I used to try out the ashift idea:

Code:

#!/bin/sh

# provide "off" as parameter to simply turn compression off; doesn't make a difference as far as I could see
COMPRESS=${1:-zle}

# restore ashift just in case I ctrl+c'ed in the middle of the script or hit an error
sysctl vfs.zfs.min_auto_ashift=12

SRC=$(mdconfig -s 32m -S 512)
dd if=/dev/random bs=512 of=/dev/${SRC} status=progress

POOL=$(mdconfig -s 1g -S 4096)
#POOL=$(mdconfig -s 128m -S 4096)
zpool create mem ${POOL}
sysctl vfs.zfs.min_auto_ashift=9
zfs create -b 512 -o compression=${COMPRESS} -V 32m mem/volume
#zfs create -b 4096 -o compression=${COMPRESS} -V 32m mem/volume
#zfs create -b 8192 -o compression=${COMPRESS} -V 32m mem/volume

dd if=/dev/zero of=/dev/zvol/mem/volume ibs=512 obs=4096 conv=sync

#HEX=$(mdconfig -s 250m)
HEX=$(mdconfig -s 2g)
newfs /dev/${HEX}
mount /dev/${HEX} /mnt
hexdump -v /dev/${SRC} > /mnt/orig.hex
dd if=/dev/${SRC} bs=512 of=/dev/zvol/mem/volume status=progress
#dd if=/dev/${SRC} of=/dev/zvol/mem/volume status=progress ibs=512 obs=4096 conv=sync
hexdump -v /dev/zvol/mem/volume > /mnt/volume.hex
echo "=========== COMPARISON ==========="
diff -sq /mnt/orig.hex /mnt/volume.hex
echo "=========== /COMPARISON =========="
sysctl vfs.zfs.min_auto_ashift=12

zfs get all mem/volume

zpool destroy mem
umount /mnt
mdconfig -d -u $(echo ${HEX} | sed 's/md//')
mdconfig -d -u $(echo ${POOL} | sed 's/md//')
mdconfig -d -u $(echo ${SRC} | sed 's/md//')

Alain De Vos · Jul 10, 2023

Use "clone" to copy files instead of raw dd copy ?

mer · Jul 10, 2023

Source sounds like it's not a ZFS pool/dataset?
Your zpool import, I'm not sure what pool you are trying to import. zvol are a type of dataset on an existing zpool, so there should be no need to "import" the zvol.
Example:
you have zpool you create a zvol dataset on that zpool
you can go and create a new UFS filesystem on that zvol and mount that UFS filesystem

To me the keys are the source: Is this something that had ZFS on it prior or another type of filesystem and you say zpool import, what are you trying to import? The zvol? In theory "create a zpool, then a zvol then create a zpool on that zvol and import that" sounds like it should be possible, but I just don't know.

VladiBG · Jul 10, 2023

For 512n hdd it doesn't matter. It will always been aligned.

Eric A. Borisch · Jul 10, 2023

Using a 512-byte blocksize zvol on a pool with ashift=12=4k means you will use 8x the space.
I don't think importing a pool from a zvol (layering ZFS on ZFS) is recommended or potentially even supported anymore.
If the old disk you are migrating from is a zpool, why not use a snapshot and send the old pool contents over to a new location on your new pool (rather than imaging the physical disk to a zvol and trying to import it as a second pool?)

malavon · Jul 10, 2023

mer said:
Source sounds like it's not a ZFS pool/dataset?
Your zpool import, I'm not sure what pool you are trying to import. zvol are a type of dataset on an existing zpool, so there should be no need to "import" the zvol.
Example:
you have zpool you create a zvol dataset on that zpool
you can go and create a new UFS filesystem on that zvol and mount that UFS filesystem

To me the keys are the source: Is this something that had ZFS on it prior or another type of filesystem and you say zpool import, what are you trying to import? The zvol? In theory "create a zpool, then a zvol then create a zpool on that zvol and import that" sounds like it should be possible, but I just don't know.

Eric A. Borisch said:
I don't think importing a pool from a zvol (layering ZFS on ZFS) is recommended or potentially even supported anymore.

Of course it's supported to import a ZFS pool from a zvol. If it weren't there would be no possibility to run bhyve virtual machines (and possibly even jails) that use a ZFS root on top of a ZFS pool. And it's perfectly possible for these to run their own jails as well. A ZFS volume is - depending on volmode - a perfect viable source for a ZFS pool, it registers just like any disk. Of course there will be peculiarities, but that doesn't matter right here and now.

Eric A. Borisch said:
If the old disk you are migrating from is a zpool, why not use a snapshot and send the old pool contents over to a new location on your new pool (rather than imaging the physical disk to a zvol and trying to import it as a second pool?)

No, I want a block-by-block copy, not just a ZFS send of the data. It doesn't matter that there used to be a zpool on one partition, I want the exact structure and exact contents of this disk mirrored onto a ZFS volume. (edit: corrected incorrect presumption)

Eric A. Borisch said:
Using a 512-byte blocksize zvol on a pool with ashift=12=4k means you will use 8x the space.

Correct, and that's my problem. I'm looking for a solution that prevents this much space is used. ZFS supports compression and sparse data, does it not?

Alain De Vos said:
Use "clone" to copy files instead of raw dd copy ?

(locally) man clone() doesn't give me anything. Do you mean sysutils/clone? Because as far as I can see this simply does a file copy. That's not what I'm looking for, I'm looking for a solution that clones an entire disk block-by-block to a ZFS volume.

Alain De Vos · Jul 10, 2023

You better use a zpool on a file instead of a zvol.

malavon · Jul 10, 2023

Alain De Vos said:
You better use a zpool on a file instead of a zvol.

Why would that be? That's what I do now out of necessity, a disk image on top of a ZFS dataset. I don't see why that would be preferred. If I wanted to migrate that disk into a VM, a ZFS volume would be a much more logical (and performant) choice.

Eric A. Borisch · Jul 10, 2023

malavon said:
Of course it's supported to import a ZFS pool from a zvol. If it weren't there would be no possibility to run bhyve virtual machines (and possibly even jails) that use a ZFS root on top of a ZFS pool. And it's perfectly possible for these to run their own jails as well. A ZFS volume is - depending on volmode - a perfect viable source for a ZFS pool, it registers just like any disk. Of course there will be peculiarities, but that doesn't matter right here and now.

That's a very different use case. In that situation, an entirely separate ZFS stack is running in the kernel inside the bhyve VM, and that stack imports the zpool — which happens to be served from a vdev, but the vm isn't aware / doesn't need to track that. This is not the same as asking ZFS to import a pool running off a vdev it is managing on top of a pool it is already supporting.

It may work, but I would certainly avoid it for most use cases. It looks like there is a switch to explicitly let you shoot your own foot if you like, however:

 $ sysctl -d vfs.zfs.vol.recursive

vfs.zfs.vol.recursive: Allow zpools to use zvols as vdevs (DANGEROUS)

Try it if you're feeling adventurous with your data, but don't expect a lot of support if it doesn't work.

malavon said:
Correct, and that's my problem. I'm looking for a solution that prevents this much space is used. ZFS supports compression and sparse data, does it not?

It does, but the ashift sets (essentially; exceptions for all-zero and extremely small/compressible blocks - see embedded_data in zpool-features(7)) the smallest recordsize it will ever store data on the pool with. (With raid and mirror, that smallest size grows futher for redundancy and alignment constraints.) On an ashift=12 (4k) zpool, each record has to be at least 4k (exceptions above), and with volblocksize=512, you've asked it to make all blocks (smallest atomic io size accessible via the zvol) 512b. To meet both requirements you've set, it sacrifices 7/8 of the storage space.

If you want to keep it as a separate pool and you want to import that pool concurrently (same kernel image/zfs stack), I would go with a plain file for now; just give dd the desired path for of=/path/to/image/file. Unless you're doing database transactions on it, you're unlikely to notice a large performance benefit with zvol vs file-backed. You could try putting that file on a filesystem with recordsize=32k, for example, to perhaps have a little better random IO performance, and still have the possibility for compression.

malavon · Jul 10, 2023

Eric A. Borisch said:
That's a very different use case. In that situation, an entirely separate ZFS stack is running in the kernel inside the bhyve VM, and that stack imports the zpool — which happens to be served from a vdev, but the vm isn't aware / doesn't need to track that. This is not the same as asking ZFS to import a pool running off a vdev it is managing on top of a pool it is already supporting.

It may work, but I would certainly avoid it for most use cases. It looks like there is a switch to explicitly let you shoot your own foot if you like, however:

Code:

$ sysctl -d vfs.zfs.vol.recursive vfs.zfs.vol.recursive: Allow zpools to use zvols as vdevs (DANGEROUS)

Try it if you're feeling adventurous with your data, but don't expect a lot of support if it doesn't work.

You are right that it's a different usecase, and I thank you for providing me with this flag. Apparently that was the reason why it wouldn't import my copied pool. It's a bit odd - and I would say there's a possibility to shoot me in the foot - that it works considering it was dd'ed to a ZFS volume with blocksize 8K, but appears to work. I presume that that's because the original partition was already aligned on 4k boundaries (as it was) and somehow ZFS managed to work everything out.
Still wondering if I'm right though... I might have to look further into this before I declare some sort of - accidental - victory.
I'm not sure hat sysctl still serves a function anymore though. As far as I can see it's something that predates OpenZFS - the only reference to it is in and UPDATING entry from 2016. On Linux such check doesn't exist apparently in the OpenZFS code, although I have no idea how much the internals on Linux and FreeBSD differ to still require such a check. That all said, the code doesn't seem to imply in any way that it's a dangerous flag, but the sysctl description does. So indeed use with care unless a developer chips in.

That said it's still a nice experiment. I was currently trying out sysutils/ddrescue to see if I could do something with it, still reading the docs though.

added as edit; that ZVOL for a VM might be shifted to a VM or even the physical machine as some sort of failover, I still consider it a pretty decent usecase

malavon · Jul 10, 2023

Also, this is what I'm banking on: man zfsprops(7):

Code:

Any block being compressed must be no larger than 7/8 of its original
       size after compression, otherwise the compression will not be
       considered worthwhile and the block saved uncompressed.

A 512 byte chunk of data in a 4096 byte block padded with zeroes and compressed with zle is automatically a bit - lexicologically, not literally 0 or 1 - over 1/8th of its size. That's why I'm trying to use compression=zle, but for that I'll first have to find a way to pad the blocks to transform them from 512 bytes to 4096 bytes.

malavon · Jul 10, 2023

Hmm, I was hoping that sysutils/ddrescue could do the padding I was looking for. It has something called fill mode, but that's not what I was looking for.
So, if someone knows of a utility, be it from base, ports or even compiled from somewhere else that would do this, please let me know. Right now I'm considering just writing a little thing that does just this, hardcoded to 512 and 4096 bytes.
In fact, I'll probably try this now.

Eric A. Borisch · Jul 10, 2023

Right, but when you use volblocksize=512, the block being compressed is 512B. It always fits in the 4k (ashift) recordsize, so unless it can fit into (embedded_data) 112B after compression, then compression can't buy you anything. (And you'll always be using 4k for each 512B block.)

I'm not following what you are planning to achieve by padding up a 512B block to 4096, but have fun!

malavon · Jul 10, 2023

Eric A. Borisch said:
I'm not following what you are planning to achieve by padding up a 512B block to 4096, but have fun!

Yes, I think I was not reading it right. I thought I could have the same number of blocks, but 7/8th filled with zeroes and nicely compressible with zle. But it appears that reasoning was false indeed.

What I was really hoping at first though is that I could use a 512 byte volblocksize and ZFS would in the background be smart enough to put them together in a 4096 (or larger) block of data. Basically I was hoping - and I still think that should be possible, not saying it _is_ possible - that somehow I could make the volblocksize a presentation thing but not a storage thing. That padding was something I thought of in the hopes that that would somehow achieve it, but as I said that appears to have been a bad reasoning.

malavon · Jul 10, 2023

LOL ... apparently diskinfo tells me that the sectorsize presented to FreeBSD is 512 bytes. So there should be no issues dd'ing a 512-byte disk to a ZFS volume created with default volblocksize. FreeBSD is basically already doing what I thought I should be doing manually somehow. Wow, now I feel even more stupid than usual

The only thing that I should have found was the "vfs.zfs.vol.recursive" sysctl which I didn't know about. I took the completely wrong train of thought on the errors that I saw.

Well, guess I can start my ZFS undestroy attempt (cfr. Thread undestroy-destroyed-zfs-dataset-snapshots.89611) now. If that goes as well as this one I might as well just give up beforehand ?

Eric A. Borisch · Jul 10, 2023

As I mentioned over there, I think you're looking at a steep climb, with likely unsatisfactory results at the end.

But as an academic exercise if you want to learn more about how ZFS puts data on the disk, by all means.

yuripv79 · Jul 10, 2023

I'm completely lost after reading the entire thread, so just a comment.

malavon said:
Next experimentation: sysctl vfs.zfs.min_auto_ashift=9

This only controls the minimal ashift that will be used when creating pool if it's not specified explicitly, it has nothing to do with creating datasets. It was also deprecated and replaced with vfs.zfs.vdev.min_auto_ashift, so its purpose is more visible.

malavon · Jul 12, 2023

Well, it seems I may have jumped from joy too early. I can mount that pool on a ZFS volume perfectly fine in read-only mode, can read all files and they all look fine. But as soon as I try to mount it in normal read-write mode, things just go haywire. For some reason zfs import keeps going and going with as far as I can see nothing happening. It's like it goes into a deadlock somewhere. I'll see if I can enable debugging somehow, but right now I don't see anything in messages or anywhere else.

Eric A. Borisch said:
As I mentioned over there, I think you're looking at a steep climb, with likely unsatisfactory results at the end.

You know what they say: no pain, no gain. Or put differentely: unless you try, you'll never know the outcome.

I haven't had time to look into this further today, I was busy trying to recover hacked Intel Management Engine garbage. Harder than it seems, apparently, but I'll get there.

Eric A. Borisch · Jul 12, 2023

malavon said:
I can mount that pool on a ZFS volume perfectly fine in read-only mode, can read all files and they all look fine. But as soon as I try to mount it in normal read-write mode, things just go haywire. For some reason zfs import keeps going and going with as far as I can see nothing happening.

Eric A. Borisch said:
vfs.zfs.vol.recursive: Allow zpools to use zvols as vdevs (DANGEROUS)

A deadlock in read/write mode seems like exactly the sort of thing that a DANGEROUS flag might be warning about.

malavon · Jul 12, 2023

Eric A. Borisch said:
A deadlock in read/write mode seems like exactly the sort of thing that a DANGEROUS flag might be warning about.

You're probably right, but still I have to reiterate that the Linux version doesn't need such a protection. Thus, it would probably be something outside of ZFS that requires this? Do you or anyone else have an idea what this difference might be? Is there something in the kernel that deadlocks? When I can reboot that system without too much risk I'll try to verify, but like I said I'm busy with the BIOS/ME side of things now.

yuripv79 · Jul 12, 2023

malavon said:
Linux version doesn't need such a protection

Or they just don't care about footshooting? (It is a problem on illumos (opensolaris), where ZFS originated; it is a problem on FreeBSD; I don't see why it would not be a problem on linux.)

chungy · Jul 12, 2023

I've deadlocked OpenZFS on Linux by trying to import a pool on zvols. It's just not a supported operation, and it's not specific to FreeBSD (though FreeBSD has a guardrail to prevent you from accidentally freezing up your system).

The supported way to use pools on top of ZFS is to use them in files.

Alain De Vos · Jul 12, 2023

Alain De Vos said:
You better use a zpool on a file instead of a zvol.

Indeed. The first works fine, the second not. Just accept it as it is.

Eric A. Borisch · Jul 12, 2023

It's been marked 'DANGEROUS' (on FreeBSD) since the merging of FreeBSD support code into upstream OpenZFS.

It appears to suggest it may work on linux (there are tests for the creation of such pools, although it's not clear that it tests any I/O on them).

If you want answers as to why, and if it will ever be supported, I would suggest the zfs-devel mailing list.

My guess is that there is some lock contention occurring at the layers where actual IO is scheduled to disk (which will be very different for linux vs. FreeBSD). Think about what happens when you write to your nested (zpool on zvol) device. ZFS at some point (which is heavily multithreaded) needs to decide it's ready to write to disk. ZFS is very picky about writing to disk, in order to (attempt to) eliminate corruption in the event of power loss / other unexpected even. As such it will need to lock something to gain exclusive access to the device / device write queue. This request then (continuing down the stack) hits ZFS again to try to write ths change to the outer pool (hosting the zvol), and it again needs to lock something to say "I'm writing over here"...

My guess on FreeBSD is that these locks are related / dependent / the same, and when the inner tries to write, its lock prevents the outer from being able to actually complete the write to physical devices.

Just a wild-*-guess, but it would fit the "read-only works, writes hang" observations.

malavon · Jul 14, 2023

Well, this was surprising. I tried running zpool import through truss (which records kernel calls) and this was the output. Note the O_RDONLY flag even though the command still hangs. It's not the read/write aspect that blocks apparently, more of a random aspect. Or maybe it's because I'm running through truss now, I don't know really.

Code:

openat(AT_FDCWD,"/dev/zvol/<path>/disk0",O_RDONLY|O_EXCL|O_CLOEXEC,00) = 6 (0x6)
close(6)                                         = 0 (0x0)
madvise(0x849fbf000,4096,MADV_FREE)              = 0 (0x0)
madvise(0x8464de000,507904,MADV_FREE)            = 0 (0x0)
<thread 102182 exited>
<thread 102180 exited>
openat(AT_FDCWD,"/dev/zvol/<path>/disk0p3",O_RDONLY|O_EXCL|O_CLOEXEC,00) = 6 (0x6)
close(6)                                         = 0 (0x0)
ioctl(3,0xc0185a05 { IORW 0x5a('Z'), 5, 24 },0x820fb0418) ERR#2 'No such file or directory'
ioctl(3,0xc0185a06 { IORW 0x5a('Z'), 6, 24 },0x820fb04b8) = 0 (0x0)
__sysctl("kern.hostid",2,0x820fb2088,0x820fb2090,0x0,0) = 0 (0x0)

I'm trying to understand openat(2) to see what is happening, but according to errno(2) error code 6 means "device not configured". I'm going to rebuild my local sources because I didn't include debug files, but something tells me that this shouldn't be happening. I'll also try this on another machine, maybe it's a local issue.
Note that I have no idea if this is actually the very latest part of the execution, I'll need to debug it to see more. Note to self: run truss -o <output-file> next time instead of a pipe.

Edit: same result on a different machine, [cmd=2]openat[/cmd] return 6 and those ioctl(2) calls also return an "ERR#2 'No such file or directory'" error. This is probably way out of my depth as I see it. I have barely an idea what I'm really looking at.
Edit 2: shows how much I know about this, the 0x6 is probably not an error but a file descriptor or something. I'm guessing that's perfectly fine as it is.