ZFS Why can't I use a ZVOL as a VDEV?

scotia · Feb 11, 2024

Hi all,

While trying to get a mirror of stripes built with ZFS (thread) I attempted to create a pool using zvols as vdevs:

Code:

# zpool create -f ZP-MIRROR-01 mirror /dev/zvol/ZP-STRIPE-01/ZP-STRIPE-01_zvol /dev/zvol/ZP-STRIPE-02/ZP-STRIPE-02_zvol
cannot create 'ZP-MIRROR-01': no such pool or dataset

where /dev/zvol/ZP-STRIPE-01/ZP-STRIPE-01_zvol and /dev/zvol/ZP-STRIPE-02/ZP-STRIPE-02_zvol are both zvols.

However as per the output I get an error. Does anyone know why?

If I replace the zvols with ordinary files in the command it works, obviously, leading me to think that a workaround might be use a pair of normal (not zvol) pools and create one file on each and use those as the vdevs.

TIA,
Scott

freejlr · Feb 11, 2024

Reference is made to the thread that mentions:

With regard to ZFS: I have a bunch of drives that add up to >22TB, and I have a single 22TB drive.

You are giving me to understand that you want to join the disks smaller than 22TB with it, and you have thought about doing it with zvol, is that true? In reference to the error that you refer to if you read this thread I have tried the solution you provide Andriy

thread

vfs.zfs.vol.recursive=1

My version of freebsd is 14.0 I've been able to create a mirror pool across two zvols with no problems, but going back to what I said at the beginning, have you tried partitioning the 22TB drive and playing with the sizes of the others?

An example:

Code:

truncate -s 1g disk0
truncate -s 500m disk{1,2}
mdconfig -a -t vnode -f disk0 -u md0
mdconfig -a -t vnode -f disk1 -u md1
mdconfig -a -t vnode -f disk2 -u md2
gpart create -s gpt md0
gpart add -t freebsd-zfs -s 500m md0
gpart add -t freebsd-zfs -s 500m md0
zpool create test mirror /dev/md0p1 /dev/md1 mirror /dev/md0p2 /dev/md2

Code:

# zpool status test
  pool: test
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            md0p1   ONLINE       0     0     0
            md1     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            md0p2   ONLINE       0     0     0
            md2     ONLINE       0     0     0

errors: No known data errors

As you can see in the example, imagine that you have a 1GB disk and two 500MB disks, and you want to group them all into a group, but you cannot group the 1GB disk with the 500MB disks by size, since a solution would be to create two partitions and play with the other two, as you can see in the example, but it could be a problem due to two vdevs residing on the same disk.

gpw928 · Feb 11, 2024

I'd proceed with a little caution. There's a history of problems using zvols as vdevs and the use of vfs.zfs.vol.recursive=1. Maybe it's fixed. Hopefully somebody with a better understanding than mine may comment.

However, if you want to make a concatenation, I would recommend using gconcat(8) to construct the provider. It works well, and it also allows you to avoid sub-optimal configurations (like multiple providers on a single physical spindle).

scotia · Feb 12, 2024

freejlr said:
Reference is made to the thread that mentions:

You are giving me to understand that you want to join the disks smaller than 22TB with it, and you have thought about doing it with zvol, is that true? In reference to the error that you refer to if you read this thread I have tried the solution you provide Andriy

thread

My version of freebsd is 14.0 I've been able to create a mirror pool across two zvols with no problems, but going back to what I said at the beginning, have you tried partitioning the 22TB drive and playing with the sizes of the others?

An example:

Code:

truncate -s 1g disk0 truncate -s 500m disk{1,2} mdconfig -a -t vnode -f disk0 -u md0 mdconfig -a -t vnode -f disk1 -u md1 mdconfig -a -t vnode -f disk2 -u md2 gpart create -s gpt md0 gpart add -t freebsd-zfs -s 500m md0 gpart add -t freebsd-zfs -s 500m md0 zpool create test mirror /dev/md0p1 /dev/md1 mirror /dev/md0p2 /dev/md2

Code:

# zpool status test pool: test state: ONLINE config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 md0p1 ONLINE 0 0 0 md1 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 md0p2 ONLINE 0 0 0 md2 ONLINE 0 0 0 errors: No known data errors

As you can see in the example, imagine that you have a 1GB disk and two 500MB disks, and you want to group them all into a group, but you cannot group the 1GB disk with the 500MB disks by size, since a solution would be to create two partitions and play with the other two, as you can see in the example, but it could be a problem due to two vdevs residing on the same disk.

Thanks freejlr, interesting approach. It never occurred to me to split the larger drive rather than join the smaller ones. So I'd end up with a JBOD of mirrors instead of a mirror of JBODS. Same same, nice.

My aversion to using gconcat as gpw928 and others have suggested was just that I can't do nice ZFS things like pre-emptively replacing a drive in the concat. Whereas your suggestion of breaking up the large drive allows me to simply resilver from the large drive in the case one of the smaller drives fails. The gconcat method would mean breaking the mirror, breaking the concat, replacing the drive then building everything back up again.

gpw928 · Feb 12, 2024

scotia said:
So I'd end up with a JBOD of mirrors instead of a mirror of JBODS. Same same, nice.

There ain't no such thing as a free lunch. You have one physical spindle involved in provisioning each side of a mirror. This might be OK with solid state media, but on conventional disks the head contention would be frantic, and seek times would be stratospheric. This is exactly what I meant by "suboptimal configurations". Not at all nice.

scotia · Feb 12, 2024

gpw928 said:
There ain't no such thing as a free lunch. You have one physical spindle involved in provisioning each side of a mirror. This might be OK with solid state media, but on conventional disks the head contention would be frantic, and seek times would be stratospheric. This is exactly what I meant by "suboptimal configurations". Not at all nice.

Neither configuration I'm considering has a single spindle on both sides of a mirror. Unless I'm seeing it wrong.

If I mirror one large disk with one concatenation of many smaller disks it's one spindle on one side, many on the other.
If I mirror many partitions of one large disk with many smaller disks again it's one spindle on one side (of many mirrors) and many on the other.

If ZFS distributes writes to many vdevs when writing to a JBOD then the second case is awful, but if it writes to one vdev until it's full then moves on to the next then it's not so bad. Do you know how ZFS writes to a JBOD? Does it assume multiple spindles and distributes activity hoping for parallel operations?

gpw928 · Feb 12, 2024

I got the single spindle issue wrong. Apologies.

It's usually considered sub-optimal to place two related vdevs on the same physical spindle.

You are proposing to create a stripe of several mirrors. ZFS is going to try to spread the writing across all stripes, causing a lot of head movement on the large single spindle (which participates in every mirror).

scotia · Feb 12, 2024

gpw928 said:
I got the single spindle issue wrong. Apologies.

It's usually considered sub-optimal to place two related vdevs on the same physical spindle.

You are proposing to create a stripe of several mirrors. ZFS is going to try to spread the writing across all stripes, causing a lot of head movement on the large single spindle (which participates in every mirror).

In which case the first option (mirror one spindle with a concat of many others) seems the best way forward. ZFS won't see the JBOD, and will assume a single pair of spindles.
But then of course losing a single disk in the gconcat JBOD is a giant pain... I'm trading off performance/wear with RTO.

gpw928 · Feb 12, 2024

scotia said:
But then of course losing a single disk in the gconcat JBOD is a giant pain...

The impact is not significantly different to losing one large disk (apart from having to re-make the concat).

However you will have to live with the fact that the MTBF of the concat is reduced with each extra spindle it uses.

VladiBG · Feb 12, 2024

Creates a new storage pool containing the virtual devices specified on
the command line. The pool name must begin with a letter, and can only
contain alphanumeric characters as well as the underscore ("_"), dash
("-"), colon (":"), space (" "), and period ("."). The pool names
mirror, raidz, draid, spare and log are reserved, as are names begin-
ning with mirror, raidz, draid, and spare. The vdev specification is
described in the "Virtual Devices" section of zpoolconcepts(7).