Hi,
I have an issue with Seagate ST10000NM0016 drives sporadically refusing to work. 8 of them are attached to a 9207-8i controller and assembled in a RAIDZ2. At unpredictable intervals, drives throw errors like
SCSI opcodes that I observed to fail are WRITE(16), READ(16), and SYNCHRONIZE CACHE(10). The problem occurs with every drive I have, and does not seem to correlate with load: a 3Tb resilvering typically completes without issues, while a light sequential read may trigger an error.
The same setup worked for several years with WD disks with nary a hiccup.
Can someone recommend next steps for me to try for debugging?
Further details about my system:
I have an issue with Seagate ST10000NM0016 drives sporadically refusing to work. 8 of them are attached to a 9207-8i controller and assembled in a RAIDZ2. At unpredictable intervals, drives throw errors like
Code:
(da6:mps2:0:45:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 365 command timeout cm 0xfffffe0001006f10 ccb 0xfffff801668ce800
(noperiph:mps2:0:4294967295:0): SMID 2 Aborting command 0xfffffe0001006f10
mps2: Sending reset from mpssas_send_abort for target ID 45
(da6:mps2:0:45:0): WRITE(16). CDB: 8a 00 00 00 00 01 8c 17 f6 68 00 00 00 08 00 00 length 4096 SMID 783 terminated ioc 804b scsi 0 state c xfer 0
mps2: Unfreezing devq for target ID 45
(da6:mps2:0:45:0): WRITE(16). CDB: 8a 00 00 00 00 01 8c 17 f6 68 00 00 00 08 00 00
(da6:mps2:0:45:0): CAM status: CCB request completed with an error
(da6:mps2:0:45:0): Retrying command
(da6:mps2:0:45:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps2:0:45:0): CAM status: Command timeout
(da6:mps2:0:45:0): Retrying command
(da6:mps2:0:45:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps2:0:45:0): CAM status: SCSI Status Error
(da6:mps2:0:45:0): SCSI status: Check Condition
(da6:mps2:0:45:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps2:0:45:0): Error 6, Retries exhausted
(da6:mps2:0:45:0): Invalidating pack
SCSI opcodes that I observed to fail are WRITE(16), READ(16), and SYNCHRONIZE CACHE(10). The problem occurs with every drive I have, and does not seem to correlate with load: a 3Tb resilvering typically completes without issues, while a light sequential read may trigger an error.
The same setup worked for several years with WD disks with nary a hiccup.
Can someone recommend next steps for me to try for debugging?
Further details about my system:
Code:
# uname -a
FreeBSD ... 11.0-STABLE FreeBSD 11.0-STABLE #0 r321665+c0805687fec(freenas/11.0-stable): Tue Sep 5 16:07:24 UTC 2017 root@gauntlet:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/_BE/os/sys/FreeNAS.amd64 amd64
Code:
# camcontrol identify da3
pass3: <ST10000NM0016-1TT101 SNB0> ACS-3 ATA SATA 3.x device
pass3: 600.000MB/s transfers, Command Queueing Enabled
protocol ATA/ATAPI-10 SATA 3.x
device model ST10000NM0016-1TT101
firmware revision SNB0
serial number ...
WWN ...
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 4096, offset 0
LBA supported 268435455 sectors
LBA48 supported 19532873728 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6
media RPM 7200
Code:
#lspci -vvv
....
05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
Subsystem: LSI Logic / Symbios Logic 9207-8i SAS2.1 HBA
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 16
Region 0: I/O ports at b000
Region 1: Memory at fb2b0000 (64-bit, non-prefetchable)
Region 3: Memory at fb2c0000 (64-bit, non-prefetchable)
Expansion ROM at fb300000 [disabled]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [d0] Vital Product Data
Not readable
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f000