Hello,
I've been fighting with this issue since a while and I didn't find root cause, or a solution/workaround. If nobody can't help me on this forum I will open a bug ticket. I'm also able to provide any troubleshooting data.
My setup:
Storage FreeBSD server with ZFS ZVOL -> iSCSI (istgt) -> ESXi host -> VMFS filesystem -> VMDK file (VM HDD) -> FreeBSD VM installed over ZFS.
The issue happens only on FreeBSD VMs with very high disk load (sometimes more than 10 file reads per second). One of them is a samba fileserver, the other one is web content storage and the website has more than 50 000 visits per day. One of the systems is FreeBSD 10.1, the other one is FreeBSD 10.3.
Current status:
On the FreeBSD physical storage server no errors of any kind:
On one of the troubled VMs with CKSUM and 122 errors (Cleared this morning with 'with zpool clear'):
Except the errors reported in the snapshots, there are more than 10 files marked with err 122 and they are complitly inaccessible and impossible to delete, move, replace or do anythingwith them.
Investigation performed:
I've been replacing each of the components in the storage one at the time and I've been monitoring the errors appearance. The troubleshooting follows as:
1. Migrate the VM to a local ESXi storage - raid controller with VMFS filesystem on the volume. This way I bypass the iSCSI, network factor and physical FreeBSD storage server with it's ZFS and disks.
Result: Still CHSUM errors
2. Create ZVOL on the physical FreeBSD Storage system, create a LUN and mount it on the VM using ESXi RDM (RAW device mapping), this way I bypass the ESXi VMFS and caches.
Result: Still CHSUM errors counter growing
3. Detach the RDM disk and configure FreeBSD software iSCSI initiator(iscsid), connect to the same LUN but this time not with the VMWare disk controller but with the FreeBSD software iSCSI controller. This way I bypass the VM Hardware storage controller.
Rezult: Still CHSUM errors growing
I have never experience this issue on physical storage systems, but I think I have never had such system with high disk load. May be the physical storage systems I use have such disk load, but there I use ZVOL instead of zfs filesystems.
If anybody have idea how to troubleshoot this please share it.
If not - I will open a bug ticket, because I'm out of options.
Thank you in advance.
Edit: bug ticket submited: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213915
I've been fighting with this issue since a while and I didn't find root cause, or a solution/workaround. If nobody can't help me on this forum I will open a bug ticket. I'm also able to provide any troubleshooting data.
My setup:
Storage FreeBSD server with ZFS ZVOL -> iSCSI (istgt) -> ESXi host -> VMFS filesystem -> VMDK file (VM HDD) -> FreeBSD VM installed over ZFS.
The issue happens only on FreeBSD VMs with very high disk load (sometimes more than 10 file reads per second). One of them is a samba fileserver, the other one is web content storage and the website has more than 50 000 visits per day. One of the systems is FreeBSD 10.1, the other one is FreeBSD 10.3.
Current status:
On the FreeBSD physical storage server no errors of any kind:
Code:
pool: iscsi
state: ONLINE
scan: scrub repaired 0 in 14h39m with 0 errors on Fri Apr 22 13:20:10 2016
config:
NAME STATE READ WRITE CKSUM
iscsi ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
ada1 ONLINE 0 0 0
ada2 ONLINE 0 0 0
ada3 ONLINE 0 0 0
ada4 ONLINE 0 0 0
On one of the troubled VMs with CKSUM and 122 errors (Cleared this morning with 'with zpool clear'):
Code:
root@web_storage1:~ # zpool status -xv web_pool
pool: web_pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://illumos.org/msg/ZFS-8000-8A
scan: resilvered 52.7G in 2h44m with 3 errors on Thu Oct 27 13:09:43 2016
config:
NAME STATE READ WRITE CKSUM
web_pool ONLINE 0 0 713
mirror-0 ONLINE 0 0 2.79K
da1 ONLINE 0 0 2.79K
da2 ONLINE 0 0 2.79K # (This mirror disk was attached to the pool for investigation purpose, please read ahead to see it's story)
errors: Permanent errors have been detected in the following files:
web_pool/trud:<0x0>
web_pool/trud@zfs-auto-snap_weekly-2016-10-16-00h14:<0x0>
web_pool/trud@zfs-auto-snap_weekly-2016-10-02-00h14:<0x0>
web_pool/trud@zfs-auto-snap_weekly-2016-10-02-00h14:<0x18e96>
web_pool/trud@zfs-auto-snap_weekly-2016-10-02-00h14:<0x18e99>
Investigation performed:
I've been replacing each of the components in the storage one at the time and I've been monitoring the errors appearance. The troubleshooting follows as:
1. Migrate the VM to a local ESXi storage - raid controller with VMFS filesystem on the volume. This way I bypass the iSCSI, network factor and physical FreeBSD storage server with it's ZFS and disks.
Result: Still CHSUM errors
2. Create ZVOL on the physical FreeBSD Storage system, create a LUN and mount it on the VM using ESXi RDM (RAW device mapping), this way I bypass the ESXi VMFS and caches.
Result: Still CHSUM errors counter growing
3. Detach the RDM disk and configure FreeBSD software iSCSI initiator(iscsid), connect to the same LUN but this time not with the VMWare disk controller but with the FreeBSD software iSCSI controller. This way I bypass the VM Hardware storage controller.
Rezult: Still CHSUM errors growing
I have never experience this issue on physical storage systems, but I think I have never had such system with high disk load. May be the physical storage systems I use have such disk load, but there I use ZVOL instead of zfs filesystems.
If anybody have idea how to troubleshoot this please share it.
If not - I will open a bug ticket, because I'm out of options.
Thank you in advance.
Edit: bug ticket submited: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=213915
Last edited: