Solved Strange ZFS status

mfoacs · Apr 4, 2024

This is a strange one:

I've recently reinstalled FreeBSD 14.0 on an old, non-mission critical PowerEdge R710.

Two zpools were created, zsystem and zdata.

Since the old BIOS doesn't give me a JBOD option, the pools were created with megaraid virtual devices all attached to the same controller.

zsystem is a zpool created from a virtual device with 2 disks RAID1, seen as mfid0 (465Gb)

zdata is a zpool with two RAID0 block devices, vd1 and vd2, created with 4 physical disks, seen as mfid1 and mfid2 (3.6Tb)

One of the virtual devices is degrated, and the result is a FAULTED pool.

But this is where nothing makes sense anymore:

Code:

zdata      FAULTED  corrupted data
     mfid2p3   ONLINE
     mfid2p3   FAULTED  corrupted data

Note that it's same device, both ONLINE and FAULTED.

Trying to import the pool obviously doesn't work:

Code:

> zpool import -f zdata
cannot import 'zdata': one or more devices is currently unavailable

gpart makes things worse:

Code:

> gpart show
=>       40  975699888  mfid0  GPT  (465G)
         40       1024      1  freebsd-boot  (512K)
       1064        984         - free -  (492K)
       2048    8388608      2  freebsd-swap  (4.0G)
    8390656  967307264      3  freebsd-zfs  (461G)
  975697920       2008         - free -  (1.0M)

=>        40  7812939696  mfid2  GPT  (3.6T)
          40        1024      1  freebsd-boot  (512K)
        1064         984         - free -  (492K)
        2048     4194304      2  freebsd-swap  (2.0G)
     4196352  7808741376      3  freebsd-zfs  (3.6T)
  7812937728        2008         - free -  (1.0M)

=>        40  7812939696  mfid1  GPT  (3.6T)
          40        1024      1  freebsd-boot  (512K)
        1064         984         - free -  (492K)
        2048     4194304      2  freebsd-swap  (2.0G)
     4196352  7808741376      3  freebsd-zfs  (3.6T)
  7812937728        2008         - free -  (1.0M)

mfid0 is where the OS lives, and the partition makes sense (the system boots).
But both mfid1,2 don't make sense. Those were wiped clean and added to the zpool, never partioned as shown.

The dagger comes when one examines the boot log (dmesg):

Code:

mfid0 on mfi0
mfid0: 476416MB (975699968 sectors) RAID volume 'Virtual Disk0' is degraded
mfid1 on mfi0
mfid1: 3814912MB (7812939776 sectors) RAID volume 'vd1' is optimal
mfid2 on mfi0
mfid2: 3814912MB (7812939776 sectors) RAID volume 'vd0' is optimal

So according to this, the FAULTED (degraded) pool, is where zsystem lives and therefore, I should not be able to boot the OS!

Neither should this be happening?

Code:

host :: / » zpool status 
  pool: zsystem
 state: ONLINE
  scan: scrub repaired 0B in 00:01:34 with 0 errors on Fri Mar  1 03:51:28 2024
config:

    NAME        STATE     READ WRITE CKSUM
    zsystem     ONLINE       0     0     0
     mfid0p3   ONLINE       0     0     0

errors: No known data errors

I know, mixing up megaraid and zfs is a bad idea, but this is on another level, does anyone have any idea what is going on?

Any help is much appreciated.

cy@ · Apr 4, 2024

What do the mfiutil show commands tell you?

Caveat: I don't have any of these controllers here but reading the man page you should be able to show the adapter, drives and volumes status. man mfiutil should tell you more.

Looking at the devices using the LSI controller's BIOS should give you the status of the volumes and devices -- assuming the LSI controller's BIOS provides the same functions HPE controllers do.

Also what do the dev.mfi sysctls show?

Eric A. Borisch · Apr 4, 2024

You might try zpool status -g, which will use guids rather than current / old names for devices. Avoids some confusion in cases where some devices have moved and some appear missing.

mfoacs · Apr 5, 2024

Eric A. Borisch said:
You might try zpool status -g, which will use guids rather than current / old names for devices. Avoids some confusion in cases where some devices have moved and some appear missing.

I would have tried that too but since the pool is not imported, it's no good.

mfoacs · Apr 5, 2024

cy@ said:
What do the mfiutil show commands tell you?

Caveat: I don't have any of these controllers here but reading the man page you should be able to show the adapter, drives and volumes status. man mfiutil should tell you more.

Looking at the devices using the LSI controller's BIOS should give you the status of the volumes and devices -- assuming the LSI controller's BIOS provides the same functions HPE controllers do.

Also what do the dev.mfi sysctls show?

Good suggestion, thanks!

Code:

centaurus :: ~ » sudo sysctl dev.mfi.
dev.mfi.0.keep_deleted_volumes: 0
dev.mfi.0.delete_busy_volumes: 0
dev.mfi.0.%parent: pci3
dev.mfi.0.%pnpinfo: vendor=0x1000 device=0x0079 subvendor=0x1028 subdevice=0x1f17 class=0x010400
dev.mfi.0.%location: slot=0 function=0 dbsf=pci0:3:0:0
dev.mfi.0.%driver: mfi
dev.mfi.0.%desc: Dell PERC H700 Integrated
dev.mfi.%parent:
centaurus :: ~ »

Interestingly enough, mfiutil agrees with dmesg but not with ZFS:

Code:

centaurus :: ~ » mfiutil show config
/dev/mfi0 Configuration: 3 arrays, 3 volumes, 0 spares
    array 0 of 2 drives:
        drive  0 (  466G) ONLINE <SEAGATE ST3500414SS KS68 serial=9WJ1JVZK> SAS
        drive MISSING
    array 1 of 1 drives:
        drive  3 ( 3726G) ONLINE <ST4000NM0033-9ZM GA0A serial=S1Z184SH> SATA
    array 2 of 1 drives:
        drive  2 ( 3726G) ONLINE <ST4000NM0033-9ZM GA0A serial=S1Z184NZ> SATA
    volume mfid0 (465G) RAID-1 64K DEGRADED <Virtual Disk0> spans:
        array 0
    volume mfid1 (3726G) RAID-0 64K OPTIMAL <vd1> spans:
        array 1
    volume mfid2 (3726G) RAID-0 64K OPTIMAL <vd0> spans:
        array 2

Note the MISSING drive, a second SEAGATE ST3500414SS KS68 SAS. The pool zsystem should be DEGRADED because one of the drives is missing. But for some reason, ZFS is reporting something else entirely. How do I fix this?

VladiBG · Apr 5, 2024

Don't use nested raids.

Your zsystem is build on top of mfi0 so the ZFS can't detect the degraded status of mfi0 because it doesn't know that drive0 is missing. That's why it's not degraded for the ZFS. To fix it you need to replace the missing drive0 with s/n:9WJ1JVZK and rebuild mfi0.

Your zdata is gone if it's build as stripe from mfid1+mfid2

mfoacs · Apr 5, 2024

VladiBG said:
Don't use nested raids.

Your zsystem is build on top of mfi0 so the ZFS can't detect the degraded status of mfi0 because it doesn't know that drive0 is missing. That's why it's not degraded for the ZFS. To fix it you need to replace the missing drive0 with s/n:9WJ1JVZK and rebuild mfi0.

Your zdata is gone if it's build as stripe from mfid1+mfid2

I will remember that.
Thanks for clarifying it.

cy@ · Apr 6, 2024

VladiBG said:
Don't use nested raids.

Your zsystem is build on top of mfi0 so the ZFS can't detect the degraded status of mfi0 because it doesn't know that drive0 is missing. That's why it's not degraded for the ZFS. To fix it you need to replace the missing drive0 with s/n:9WJ1JVZK and rebuild mfi0.

Your zdata is gone if it's build as stripe from mfid1+mfid2

I wouldn't say don't use nested raids. At $JOB we use ZFS on top of hardware RAID5 (on Solaris). As far as ZFS is concerned, the device is simply JBOD. But beneath it is hardware RAID. It's not nested RAIDs because ZFS is configured to use JBOD while the SAN RAID5 handles error recovery and replacement of damaged devices. The Solaris Team manages the servers while the Enterprise Storage Team manage the storage. Devices may be replaced by the storage team unbeknownst to the sysadmin teams (Solaris, Linux, VMware, Windows teams).

VladiBG · Apr 6, 2024

Maybe you should sit with your Storage Team and discus this further.

ThePowerOfFuet · Apr 16, 2024

mfoacs said:
Since the old BIOS doesn't give me a JBOD option, the pools were created with megaraid virtual devices

Ah yes, upon reading this I knew where we were headed.

Don't do that.

Rip out the RAID controller and replace it with an HBA in IT mode. You can even get ones which are crossflashed with the correct PCI ID so that you can use them in the storage controller slot without the machine complaining.

Source: I run R710s with this exact configuration.