ZFS VMware VM lots of CKSUM errors and sometimes even 'unkown error 122'

gnoma · Oct 27, 2016

Hello,

I've been fighting with this issue since a while and I didn't find root cause, or a solution/workaround. If nobody can't help me on this forum I will open a bug ticket. I'm also able to provide any troubleshooting data.

My setup:

Storage FreeBSD server with ZFS ZVOL -> iSCSI (istgt) -> ESXi host -> VMFS filesystem -> VMDK file (VM HDD) -> FreeBSD VM installed over ZFS.

The issue happens only on FreeBSD VMs with very high disk load (sometimes more than 10 file reads per second). One of them is a samba fileserver, the other one is web content storage and the website has more than 50 000 visits per day. One of the systems is FreeBSD 10.1, the other one is FreeBSD 10.3.

Current status:

On the FreeBSD physical storage server no errors of any kind:

Code:

  pool: iscsi
 state: ONLINE
  scan: scrub repaired 0 in 14h39m with 0 errors on Fri Apr 22 13:20:10 2016
config:

   NAME                             STATE     READ WRITE CKSUM
   iscsi                            ONLINE       0     0     0
     raidz1-0                       ONLINE       0     0     0
       ada1                         ONLINE       0     0     0
       ada2                         ONLINE       0     0     0
       ada3                         ONLINE       0     0     0
       ada4                         ONLINE       0     0     0

On one of the troubled VMs with CKSUM and 122 errors (Cleared this morning with 'with zpool clear'):

Code:

root@web_storage1:~ # zpool status -xv web_pool
  pool: web_pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
   corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
   entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 52.7G in 2h44m with 3 errors on Thu Oct 27 13:09:43 2016
config:

   NAME               STATE     READ WRITE CKSUM
   web_pool           ONLINE       0     0   713
     mirror-0         ONLINE       0     0 2.79K
       da1            ONLINE       0     0 2.79K
       da2            ONLINE       0     0 2.79K  # (This mirror disk was attached to the pool for investigation purpose, please read ahead to see it's story)

errors: Permanent errors have been detected in the following files:

        web_pool/trud:<0x0>
        web_pool/trud@zfs-auto-snap_weekly-2016-10-16-00h14:<0x0>
        web_pool/trud@zfs-auto-snap_weekly-2016-10-02-00h14:<0x0>
        web_pool/trud@zfs-auto-snap_weekly-2016-10-02-00h14:<0x18e96>
        web_pool/trud@zfs-auto-snap_weekly-2016-10-02-00h14:<0x18e99>

Except the errors reported in the snapshots, there are more than 10 files marked with err 122 and they are complitly inaccessible and impossible to delete, move, replace or do anythingwith them.

Investigation performed:

I've been replacing each of the components in the storage one at the time and I've been monitoring the errors appearance. The troubleshooting follows as:

1. Migrate the VM to a local ESXi storage - raid controller with VMFS filesystem on the volume. This way I bypass the iSCSI, network factor and physical FreeBSD storage server with it's ZFS and disks.
Result: Still CHSUM errors

2. Create ZVOL on the physical FreeBSD Storage system, create a LUN and mount it on the VM using ESXi RDM (RAW device mapping), this way I bypass the ESXi VMFS and caches.
Result: Still CHSUM errors counter growing

3. Detach the RDM disk and configure FreeBSD software iSCSI initiator(iscsid), connect to the same LUN but this time not with the VMWare disk controller but with the FreeBSD software iSCSI controller. This way I bypass the VM Hardware storage controller.
Rezult: Still CHSUM errors growing

I have never experience this issue on physical storage systems, but I think I have never had such system with high disk load. May be the physical storage systems I use have such disk load, but there I use ZVOL instead of zfs filesystems.

If anybody have idea how to troubleshoot this please share it.
If not - I will open a bug ticket, because I'm out of options.

Thank you in advance.

Edit: bug ticket submited: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213915

ZFS VMware VM lots of CKSUM errors and sometimes even 'unkown error 122'

gnoma