ZFS multiple new SAS drives faulting on raidz3 pool

We have a set of 12 disks in a raidz3 pool that will not stay online. The pool will degrade will multiple drives faulting. The disks are throwing sector errors in the kernel at the same time. These are relatively new Seagate 18TB Exos X18 SAS drives (few months old) connected to a LSI SAS9300-8e. Firmware is up-to-date on the SAS card. No problems with SMART on these drives. We have several other pools connected to the same SAS card in the same JBOD with no issues.

In trying to eliminate a backplane/cabling/SAS card issue, we've moved the pool of disks to a new headnode that has a different SAS card and different JBOD. We've tried this with a 3rd headnode/SAS card/JBOD and still the same issue. The disks are still faulting with sector errors. We have another headnode with a LSI SAS9305-16e card that we could try moving these pool of disks to.

Any ideas here? It could obviously just be a problem with the disks but it's odd that they all throw sector errors and fault at the same time.



Code:
[1845440.353235] blk_update_request: I/O error, dev sdbt, sector 35156637696 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845440.354582] blk_update_request: I/O error, dev sdbt, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[1845440.355782] blk_update_request: I/O error, dev sdbt, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845440.356967] blk_update_request: I/O error, dev sdbv, sector 35156637696 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845440.358121] blk_update_request: I/O error, dev sdbv, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
[1845440.359243] blk_update_request: I/O error, dev sdbv, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845470.353142] blk_update_request: I/O error, dev sdbn, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845500.354534] blk_update_request: I/O error, dev sdbn, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845530.355940] blk_update_request: I/O error, dev sdbn, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845560.357254] blk_update_request: I/O error, dev sdbn, sector 35156637696 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845591.366774] blk_update_request: I/O error, dev sdbr, sector 0 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
[1845621.363131] blk_update_request: I/O error, dev sdbr, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845651.364488] blk_update_request: I/O error, dev sdbr, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845681.365922] blk_update_request: I/O error, dev sdbr, sector 35156637696 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845715.366022] blk_update_request: I/O error, dev sdbt, sector 0 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
[1845745.367435] blk_update_request: I/O error, dev sdbt, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845775.368848] blk_update_request: I/O error, dev sdbt, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845805.370227] blk_update_request: I/O error, dev sdbt, sector 35156637696 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845835.373079] blk_update_request: I/O error, dev sdbv, sector 0 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[1845865.374526] blk_update_request: I/O error, dev sdbv, sector 0 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[1845895.375920] blk_update_request: I/O error, dev sdbv, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845925.377237] blk_update_request: I/O error, dev sdbv, sector 35156637696 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845959.374128] blk_update_request: I/O error, dev sdbx, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1845989.375519] blk_update_request: I/O error, dev sdbx, sector 0 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[1846019.376916] blk_update_request: I/O error, dev sdbx, sector 2048 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[1846049.378375] blk_update_request: I/O error, dev sdbx, sector 35156637696 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
 
Multiple errors at the same time? Probably not the individual drives.

In the logs (dmesg and /var/log/messages), look for the more detailed SCSI error messages, which contain the ASC/ASCQ cause of the error, then we can decode that.
 
When mentioning JBOD units, I'm assuming you are referring to seperate (4 U or similar) disk shelve units that are connected via an external SAS cable (one or even more?) to the LSI SAS9300-8e (or equivalent other SAS card). In the shelves there must be port expanders, crazy idea perhaps: are those port expanders compatible with those new (big) Exos* drives? Perhaps a SAS-2 versus SAS-3 incompatibility, though that should normally mix and match into a working ensemble.

Edit: when mentioning moving the new disks to other JBODs and connecting to other external SAS cards, it looks like you are testing a new installation for approval towards production. If you have the option of connecting the new drives internally and directly (i.e., no expander) to a SAS card, do the same errors occur? If trying to confirm that the new disks are at fault (I'm presuming when having them ordered they are from one batch and that that batch might possibly at fault), they would exhibit the same errors as you mentioned in the external setups; probably even a small set of new drives in a RAIDZ2-3 setting would already be sufficient to evoke the same errors.

* Looking at the Exos X18 specs there are 5 different SKUs, the three SAS SKUs are 12Gb/s, that would be SAS-3. I also noticed there at footnote 2 (for two of the three SAS SKUs):
2 Self-Encrypting Drives (SED) and FIPS 140-3 Validated drives available through franchised authorized distributors. May require TCG-compliant host or controller support.
Unfortunately, If you have those SKUs, I have no idea how to assess that information and relating it to your set-up.
 
In theory, any expander should work with any drive and with any HBA. In practice, slight incompatibilities can kill that quickly. That happens in particular during error handling. So one plausible theory is that one of the JBODs has a slight hardware problem (could be as boring as a bad contact on the power connection), leading to communication errors, which are then not retried correctly but escalated to the host.

About SED: I think those ship preconfigured with a single unlocked zone, and should just work. It's very unlikely that a normal person has SED or FIPS drives, as they are only availably through special channels, and hopefully don't show up in the used market.

But the important thing: Without better error information, it's hard to debug this. The kernel logs usually have much more detail than what was shown here.
 
I really appreciate all the help, ralphbsz and Erichans.
I added more verbose dmesg logs from the 1st system at the bottom of this post. The previous system was using the external SAS to an external disk shelf unit. We moved the pool of drives to a different system that has both an internal SAS card and external SAS card. The pool degraded after a bit of time with only 1 of the drives faulting this time. After the weekend, I'll head back to the datacenter to check if this faulted drive is connected to the external SAS.

New system:
03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
Subsystem: Broadcom / LSI 9207-8i SAS2.1 HBA
04:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Subsystem: Broadcom / LSI SAS9300-8e


[6712084.070357] sd 7:0:90:0: [sdat] tag#5697 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [6712084.070368] sd 7:0:90:0: [sdat] tag#5697 Sense Key : Aborted Command [current] [descriptor] [6712084.070374] sd 7:0:90:0: [sdat] tag#5697 Add. Sense: Nak received [6712084.070380] sd 7:0:90:0: [sdat] tag#5697 CDB: Read(16) 88 00 00 00 00 04 e4 19 a9 b0 00 00 02 60 00 00 [6712084.070383] blk_update_request: I/O error, dev sdat, sector 21006756272 op 0x0:(READ) flags 0x700 phys_seg 19 prio class 0 [6712084.071943] zio pool=backup2 vdev=/dev/disk/by-id/scsi-35000c500da0eca23-part1 error=5 type=1 offset=10755458162688 size=311296 flags=40080ca8


pool: backup2 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: [URL]https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P[/URL] scan: resilvered 31.9T in 1 days 17:50:35 with 0 errors on Fri Oct 13 09:14:17 2023 config: NAME STATE READ WRITE CKSUM backup2 DEGRADED 0 0 0 raidz3-0 DEGRADED 0 0 0 scsi-35000c500d9e34b17 ONLINE 0 0 19 scsi-35000c500da0ebcc3 ONLINE 0 0 104 scsi-35000c500d9e39bab ONLINE 0 0 19 scsi-35000c500d9e2fc5b ONLINE 0 0 19 scsi-35000c500d9e3e70f ONLINE 0 0 19 scsi-35000c500d9e338cb ONLINE 0 0 156 scsi-35000c500d9e3b6e3 ONLINE 0 0 19 scsi-35000c500d9fea02f ONLINE 0 0 41 scsi-35000c500da0eca23 DEGRADED 20 0 148 too many errors scsi-35000c500d9e383bb ONLINE 0 0 99 scsi-35000c500da0ec6cb ONLINE 0 0 22 scsi-35000c500d9ce2d6f ONLINE 0 0 21



--Verbose logs below of dmesg from the 1st system.

[1706609.473813] zio pool=backup vdev=/dev/sdbv1 error=5 type=2 offset=13342811217920 size=49152 flags=40080c80 [1713182.744652] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [1713182.744675] sd 0:0:75:0: [sdbx] tag#6215 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s [1713182.744681] sd 0:0:75:0: [sdbx] tag#6215 CDB: Write(16) 8a 00 00 00 00 06 67 a0 15 60 00 00 00 60 00 00 [1713182.744683] blk_update_request: I/O error, dev sdbx, sector 27508348256 op 0x1:(WRITE) flags 0x700 phys_seg 11 prio class 0 [1713182.745755] zio pool=backup vdev=/dev/sdbx1 error=5 type=2 offset=14084273258496 size=49152 flags=40080c80 [1713495.287302] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [1713495.287328] sd 0:0:73:0: [sdbv] tag#6236 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s [1713495.287336] sd 0:0:73:0: [sdbv] tag#6236 CDB: Write(16) 8a 00 00 00 00 06 68 a1 07 50 00 00 00 40 00 00 [1713495.287339] blk_update_request: I/O error, dev sdbv, sector 27525187408 op 0x1:(WRITE) flags 0x700 phys_seg 6 prio class 0 [1713495.288803] zio pool=backup vdev=/dev/sdbv1 error=5 type=2 offset=14092894904320 size=32768 flags=40080c80 [1714276.126299] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [1714276.126342] sd 0:0:65:0: [sdbn] tag#6353 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s [1714276.126347] sd 0:0:65:0: [sdbn] tag#6353 CDB: Write(16) 8a 00 00 00 00 06 74 f5 3a 00 00 00 03 20 00 00 [1714276.126348] blk_update_request: I/O error, dev sdbn, sector 27732032000 op 0x1:(WRITE) flags 0x700 phys_seg 88 prio class 0 [1714276.127645] zio pool=backup vdev=/dev/sdbn1 error=5 type=2 offset=14198799335424 size=409600 flags=40080c80 [1714401.392674] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [1714401.392697] sd 0:0:75:0: [sdbx] tag#6428 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s [1714401.392705] sd 0:0:75:0: [sdbx] tag#6428 CDB: Write(16) 8a 00 00 00 00 06 73 da 0b 70 00 00 02 20 00 00 [1714401.392707] blk_update_request: I/O error, dev sdbx, sector 27713473392 op 0x1:(WRITE) flags 0x700 phys_seg 62 prio class 0 [1714401.394071] zio pool=backup vdev=/dev/sdbx1 error=5 type=2 offset=14189297328128 size=278528 flags=40080c80 [1714527.724254] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [1714527.724275] sd 0:0:55:0: [sdbd] tag#6483 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s [1714527.724282] sd 0:0:55:0: [sdbd] tag#6483 CDB: Write(16) 8a 00 00 00 00 06 77 43 54 a0 00 00 02 20 00 00 [1714527.724285] blk_update_request: I/O error, dev sdbd, sector 27770705056 op 0x1:(WRITE) flags 0x700 phys_seg 66 prio class 0 [1714527.725905] zio pool=backup vdev=/dev/sdbd1 error=5 type=2 offset=14218599940096 size=278528 flags=40080c80 [1714645.122291] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [1714645.122311] sd 0:0:59:0: [sdbh] tag#6567 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s [1714645.122320] sd 0:0:59:0: [sdbh] tag#6567 CDB: Write(16) 8a 00 00 00 00 06 79 18 10 b0 00 00 03 20 00 00 [1714645.122322] blk_update_request: I/O error, dev sdbh, sector 27801424048 op 0x1:(WRITE) flags 0x700 phys_seg 99 prio class 0 [1714645.123771] zio pool=backup vdev=/dev/sdbh1 error=5 type=2 offset=14234328064000 size=409600 flags=40080c80 [1719537.040475] mpt3sas_cm0: log_info(0x31120303): originator(PL), code(0x12), sub_code(0x0303) [1719537.040490] sd 0:0:75:0: [sdbx] tag#7323 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=0s [1719537.040494] sd 0:0:75:0: [sdbx] tag#7323 CDB: Write(16) 8a 00 00 00 00 06 bc 71 ea a0 00 00 04 60 00 00
 
Hmm. Hard to see what the root cause is here. The only hint I get is that the SCSI sense key is "aborted command", so the disk thinks something aborted it executed the IO. The ASC (=additional sense code) seems to be "NAK", which means negative acknowledge. That happens on the SAS interface. Sadly, this error message format has no ASCQ, although I'm not sure it would help much.

So for some reason the drive (or the HBA?) got a NAK on the communications link, which it then interpreted to mean "abort the command". Clearly, some communications problem. My first suggestion would be: Check all cables. After that, it gets really hard to debug.
 
Sorry to bring up an old thread, I'm having exactly this.

EXOS drives in a Supermicro 4U external chassis.
In my case I have 36 of these drives inside the supermicro server (superserver) which have no issues and 36 in a 4U expansion chassis which is not happy.

I've spotted there's a jumper in the chassis to switch between SAS2 and SAS3 which I'm going to check out as soon as I can power down the drives.

Similar errors.

How did you sort this shorton ?

Code:
Feb 12 13:36:50 gss-zfs02 kernel: sd 5:0:2:0: [sdan] tag#8576 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Feb 12 13:36:50 gss-zfs02 kernel: sd 5:0:2:0: [sdan] tag#8576 Sense Key : Aborted Command [current] [descriptor]
Feb 12 13:36:50 gss-zfs02 kernel: sd 5:0:2:0: [sdan] tag#8576 Add. Sense: Nak received
Feb 12 13:36:50 gss-zfs02 kernel: sd 5:0:2:0: [sdan] tag#8576 CDB: Read(16) 88 00 00 00 00 00 74 e7 25 48 00 00 01 08 00 00
Feb 12 13:37:05 gss-zfs02 zed[837741]: eid=39283 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6eec7af-part1 vdev_state=FAULTED
Feb 12 13:37:05 gss-zfs02 zed[837824]: eid=39284 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6eec7af-part1 vdev_state=DEGRADED
Feb 12 13:37:05 gss-zfs02 zed[837834]: eid=39285 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d745a793-part1 vdev_state=FAULTED
Feb 12 13:37:05 gss-zfs02 zed[837854]: eid=39286 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6eb364f-part1 vdev_state=FAULTED
Feb 12 18:31:00 gss-zfs02 kernel: sd 5:0:7:0: [sdas] tag#8720 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Feb 12 18:31:00 gss-zfs02 kernel: sd 5:0:7:0: [sdas] tag#8720 Sense Key : Aborted Command [current] [descriptor]
Feb 12 18:31:00 gss-zfs02 kernel: sd 5:0:7:0: [sdas] tag#8720 Add. Sense: Nak received
Feb 12 18:31:00 gss-zfs02 kernel: sd 5:0:7:0: [sdas] tag#8720 CDB: Read(16) 88 00 00 00 00 03 93 b3 0d c8 00 00 00 30 00 00
Feb 12 21:00:39 gss-zfs02 kernel: sd 5:0:1:0: [sdam] tag#2200 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Feb 12 21:00:39 gss-zfs02 kernel: sd 5:0:1:0: [sdam] tag#2200 Sense Key : Aborted Command [current] [descriptor]
Feb 12 21:00:39 gss-zfs02 kernel: sd 5:0:1:0: [sdam] tag#2200 Add. Sense: Nak received
Feb 12 21:00:39 gss-zfs02 kernel: sd 5:0:1:0: [sdam] tag#2200 CDB: Read(16) 88 00 00 00 00 06 4b 65 07 f0 00 00 02 a0 00 00
Feb 12 21:00:54 gss-zfs02 zed[3595061]: eid=40474 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6ec88b3-part1 vdev_state=FAULTED
Feb 12 21:00:54 gss-zfs02 zed[3595133]: eid=40475 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6ec88b3-part1 vdev_state=DEGRADED
Feb 12 21:00:54 gss-zfs02 zed[3595155]: eid=40476 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d745a793-part1 vdev_state=FAULTED
Feb 12 21:00:54 gss-zfs02 zed[3595170]: eid=40477 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6eb364f-part1 vdev_state=FAULTED
Feb 13 05:09:03 gss-zfs02 kernel: sd 5:0:33:0: [sdbr] tag#9287 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Feb 13 05:09:03 gss-zfs02 kernel: sd 5:0:33:0: [sdbr] tag#9287 Sense Key : Aborted Command [current] [descriptor]
Feb 13 05:09:03 gss-zfs02 kernel: sd 5:0:33:0: [sdbr] tag#9287 Add. Sense: Nak received
Feb 13 05:09:03 gss-zfs02 kernel: sd 5:0:33:0: [sdbr] tag#9287 CDB: Read(16) 88 00 00 00 00 03 a0 a8 d7 10 00 00 01 b8 00 00
Feb 13 05:09:22 gss-zfs02 zed[3126012]: eid=41026 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d775bfe3-part1 vdev_state=FAULTED
Feb 13 05:09:22 gss-zfs02 zed[3126069]: eid=41027 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d745a793-part1 vdev_state=FAULTED
Feb 13 05:09:22 gss-zfs02 zed[3126082]: eid=41028 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6eb364f-part1 vdev_state=FAULTED
Feb 13 05:09:22 gss-zfs02 zed[3126100]: eid=41029 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d775bfe3-part1 vdev_state=DEGRADED
Feb 13 09:29:15 gss-zfs02 kernel: sd 5:0:31:0: [sdbp] tag#5943 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Feb 13 09:29:15 gss-zfs02 kernel: sd 5:0:31:0: [sdbp] tag#5943 Sense Key : Aborted Command [current] [descriptor]
Feb 13 09:29:15 gss-zfs02 kernel: sd 5:0:31:0: [sdbp] tag#5943 Add. Sense: Nak received
Feb 13 09:29:15 gss-zfs02 kernel: sd 5:0:31:0: [sdbp] tag#5943 CDB: Read(16) 88 00 00 00 00 06 d5 bd 4c 58 00 00 00 c8 00 00
Feb 13 09:29:30 gss-zfs02 zed[972797]: eid=41327 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d78fd723-part1 vdev_state=FAULTED
Feb 13 09:29:30 gss-zfs02 zed[972912]: eid=41328 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d745a793-part1 vdev_state=FAULTED
Feb 13 09:29:30 gss-zfs02 zed[972942]: eid=41329 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d6eb364f-part1 vdev_state=FAULTED
Feb 13 09:29:30 gss-zfs02 zed[972970]: eid=41330 class=statechange pool='supermicro-c1' vdev=wwn-0x5000c500d78fd723-part1 vdev_state=DEGRADED
Feb 13 11:54:55 gss-zfs02 kernel: sd 5:0:18:0: [sdbd] tag#4298 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Feb 13 11:54:55 gss-zfs02 kernel: sd 5:0:18:0: [sdbd] tag#4298 Sense Key : Aborted Command [current] [descriptor]
Feb 13 11:54:55 gss-zfs02 kernel: sd 5:0:18:0: [sdbd] tag#4298 Add. Sense: Nak received
Feb 13 11:54:55 gss-zfs02 kernel: sd 5:0:18:0: [sdbd] tag#4298 CDB: Read(16) 88 00 00 00 00 04 4a cd 8c c8 00 00 04 70 00 00
 
Back
Top