ZFS Scrubbing gave errors for one disk but smartctl long test is OK

nxjoseph · Saturday at 1:08 PM

Hi. Yesterday, I did zpool scrubbing my first time, then I got too many errors error about one disk, the pool is called tank and used to have 3 disks mirrored, I removed the disk that give errors and the pool has now 2 disks and sufficient replicas, but, today, I did a long test with smartctl but i got no errors, then why it give errors while scrubbing? Yesterday, I was not even be able to run smartctl -a on it (having input/output error) but after a reboot the problem was gone. Thanks in advance.

Code:

# smartctl -l selftest /dev/ada2
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.2-RELEASE amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5557         -
# 2  Extended offline    Aborted by host               90%      5555         -
# 3  Short offline       Completed without error       00%      5555         -
# 4  Short offline       Completed without error       00%      3863         -
# 5  Short offline       Completed without error       00%      2284         -
# 6  Short offline       Completed without error       00%         0         -

adorno · Saturday at 2:04 PM

nxjoseph said:
Yesterday, I was not even be able to run smartctl -a on it (having input/output error) but after a reboot the problem was gone. Thanks in advance.

In hindsight it would have been interesting if the faulting disk would still be attached to the pool to see what results scrubbing would give now after the reboot.

Given the relatively young age of the disk (at least in operation) the problem might actually be related to the cable or connectors. Depending on system load and resulting temperature changes an improperly seated (or otherwise faulty) connector might lead to intermittent errors. So for now I'd check all connectors and maybe replace the cable to the toshiba drive. After some additional checks I might actually feel lucky enough to reattach the drive to the pool.

freejlr · Saturday at 3:33 PM

According to your scrub you didn't fix anything, you could have reported a:

zpool status tank -v

On the other hand, your disk appears in FAULTED state, not degraded. It seems that it was inaccessible to your pool.

Should I run a zpool clear to verify that this is not a previous error?

ralphbsz · Sunday at 3:10 AM

Could it be that the problem is not the disk hardware, nor the communication with it, but the content? Perhaps some things on the disk were overwritten by something else, and scrub had to fix incorrect content?

Look in /var/log/messages for disk IO error messages.

I have no idea how to ask ZFS scrub what it found. Is there a way?

nxjoseph · Sunday at 5:13 PM

/var/log/messages:

Code:

Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 d8 b8 47 40 19 00 00 08 00 00
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 c8 d8 c0 47 40 19 00 00 07 00 00
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a0 c8 47 40 19 00 00 08 00 00
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 82 00 40 00 00 00 00 00 00
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 7c 42 40 25 00 00 00 00 00
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 7e 42 40 25 00 00 00 00 00
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 00 a0 d0 47 40 19 00 00 08 00 00
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:29 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:29 hale ZFS[1563]: vdev I/O failure, zpool=tank path=/dev/gpt/toshiba0 offset=270336 size=8192 error=5
Jan  3 22:34:29 hale ZFS[2955]: vdev I/O failure, zpool=tank path=/dev/gpt/toshiba0 offset=320041656320 size=8192 error=5
Jan  3 22:34:29 hale ZFS[3995]: vdev I/O failure, zpool=tank path=/dev/gpt/toshiba0 offset=320041918464 size=8192 error=5
Jan  3 22:34:29 hale ZFS[4404]: vdev probe failure, zpool=tank path=/dev/gpt/toshiba0
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): CAM status: ATA Status Error
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 04 (ABRT )
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): RES: 41 04 00 00 00 40 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): Retrying command, 0 more tries remain
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): CAM status: ATA Status Error
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): ATA status: 41 (DRDY ERR), error: 04 (ABRT )
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): RES: 41 04 00 00 00 40 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Retries exhausted
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): Synchronize cache failed
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 af ea 42 40 25 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 af ea 42 40 25 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 ea 42 40 25 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae ea 42 40 25 00 00 00 00 00
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): CAM status: Auto-Sense Retrieval Failed
Jan  3 22:34:30 hale kernel: (ada2:ahcich4:0:0:0): Error 5, Unretryable error

I did attach /dev/ada2 a.k.a gpt/toshiba0 back to the tank pool. I wonder if i should do a # zpool scrub again.

Code:

# zpool status -v tank
  pool: tank
 state: ONLINE
  scan: resilvered 257G in 01:09:34 with 0 errors on Sun Jan  5 15:15:40 2025
remove: Removal of vdev 2 copied 1.63G in 0h0m, completed on Sun Sep 15 20:54:16 2024
        5.53K memory used for removed device mappings
config:

        NAME              STATE     READ WRITE CKSUM
        tank              ONLINE       0     0     0
          mirror-0        ONLINE       0     0     0
            gpt/wdc0      ONLINE       0     0     0
            gpt/hgst0     ONLINE       0     0     0
            gpt/toshiba0  ONLINE       0     0     0

errors: No known data errors

cracauer@ · Sunday at 5:17 PM

Bad cable or connection.

fmc000 · Sunday at 5:26 PM

Also check for dust and/or metal flakes inside the connectors (don't ask me why I'm suggesting this...)

Phishfry · Sunday at 6:58 PM

Either cosmic debris or solar flare flipped a bit in RAM. Are you using ECC Memory?

Aurora alerts

Welcome to the most reliable and extensive space weather and aurora alert service available! You can sign up here for push notifications, Twitter alerts and browser pop up notifications! We provide a wide variety of space weather alerts like aurora alerts, solar flare alerts and so much more! Bes...

www.spaceweather.live

ZFS Scrubbing gave errors for one disk but smartctl long test is OK

Attachments