ZFS boot failures on recent kernel builds

I've been tracking -STABLE for some years, but recently I'm encountering some breakage in the ZFS boot of a couple of my recent builds. The most recent bootable kernel I have is stable/14-n270866-3cdf03dbfeff.

Two specific examples of kernels that fail to boot (for me, at least) are stable/14-n270959-54a94356c90e and stable/14-n271006-4c07ee6a5eaf. Both give me an error I haven't seen before, pcib10: Power Fault Detected and subsequently the infamous:

Code:
Solaris: NOTICE: Cannot find the pool label for 'zroot'
Mounting ... failed with error 5 ...

This thread from 2023 has a partial resonance with my issue, as I am also running on a Dell Precision T5820. However my system has been running flawlessly up through approximately stable/14-n270866-3cdf03dbfeff. Only newer kernels fail to boot.

When booting n271006 for example, the point at which things seem to go wrong is:

Code:
pcm3: <NVIDIA ... > at nid 5 on hdaa1
pcm4: <NVIDIA ... > at nid 6 on hdaa1
pcm5: <NVIDIA ... > at nid 7 on hdaa1
Here:
Code:
pcib10: Power Fault Detected
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
ugen0.1: <Intel XHCI root HUB> at usbus0
uhub0 numa-domain 0 on usbus0
uhub0: <Intel XHCI root HUB, class ....>
[nda0 is my boot pool device:]
Code:
nda0 at nvme0 bus 0 scbus 9 target 0 lun 1
[normal healthy entries for nda0]
[ada0 and ada1 hold a secondary pool, normal entries here]
Code:
Trying to mount root from zfs:zroot/ROOT/default []...
nvme0: failing outstanding i/o
Then there are eight pairs of:
Code:
nvme0: ASYNC_EVENT_REQUEST (0c) ....
nvme0: ABORTED - BY REQUEST (00/07) ....
followed by:
Code:
nda0 at nvme0 bus 0 scbus 9 target 0 lun 1
nda0: <model # NVMe TOSHIBA 512GB ...>
The fatal blow is:
Code:
(nda0:nvme0:0:0:1): Periph destroyed
nvme0: detached
pci10: detached
Solaris: NOTICE: Cannot find the pool label for 'zroot'
Mounting ... failed with error 5 ...
Pardon the fish-eye in these pictures, but I'll include them in case they show pertinent details that I've omitted above.

What is triggering the pcib10: Power Fault Detected and the subsequent destruction of the nda0/nvme0 device nodes?

As indicated earlier, reverting to a boot of kernel stable/14-n270866-3cdf03dbfeff results in a successful boot with no problems or pool errors.

Thanks in advance for any assistance!
 

Attachments

  • kernel-power-fault-detected.jpg
    kernel-power-fault-detected.jpg
    274.4 KB · Views: 30
  • kernel-periph-destroyed.jpg
    kernel-periph-destroyed.jpg
    309.1 KB · Views: 27
Welp, it looks like maybe in the process of chasing it down, I chased it away. I have a good build of n271007 and I'll keep testing that.

Bonus is, I kind of taught myself how to use git bisect.
 
A slightly more systematic attempt at bisecting (and better testing of the resulting kernel build) shows the problematic commit to be this seemingly small patch from back in late February. Applying that as a reverse patch results in a kernel that can recognize and boot from the pool on my NVME drive.
 
Back
Top