ZFS zpool keeps rebooting the server

cy@ · May 5, 2023

nbari said:
I used the same cable but moved ada4 to a free (unused port) and moved ada5 where it was ada3, that in the output of ZFS looks like a swap between (ada2 and ada4), fortunately, no data loss and can copy/move files across without problem (so far looks good)

I recently moved the machine to a closed rack that is reaching ~40C (probably too hot).

A bad cable can cause a hard hang or even crash (panic) because arithmetic performed on faulty metadata, corrupted by a bad cable or hardware, in memory may dereference to non-existent memory resulting an a kernel page fault.

Crash (panic) because the dereference is bogus and not there. Hard hang because the dereference is bogus but points to memory that exists fooling the kernel into thinking a lock is being held when it is not. (Had that happen before on FreeBSD an on mainframe. Hard to debug because a reset destroys all diagnostic information in memory.)

Alain De Vos · May 5, 2023

Code:

zpool status -x

nbari · May 5, 2023

Now I can see:

Code:

# zpool status -x
all pools are healthy

But I had to open the case, guessing what to plug/unplug since unfortunately couldn't get any log/message, wondering what can be done to get a dump when having root on ZFS.

mer · May 5, 2023

nbari said:
Now I can see:

Code:

# zpool status -x all pools are healthy

But I had to open the case, guessing what to plug/unplug since unfortunately couldn't get any log/message, wondering what can be done to get a dump when having root on ZFS.

My opinion only:
"getting a dump when having root on ZFS"
I always partion a swap partition. "gpart add -t freebsd-swap" Given the current sizes (cost per byte) I don't worry about creating a 4G swap partition on a 1T device.

One can always swap to a file, but if memory pressure and ZFs swapfile you have a positive feedback loop that spirals to reboot.
Yes, you can't backfit a swap partition (typically) butif you can add a new device even USB and make it swap you may be able to get information.

Alain De Vos · May 5, 2023

My memory is 32GB , my swap-partition also. [Just in case when needed]

mer · May 5, 2023

Alain De Vos said:
My memory is 32GB , my swap-partition also. [Just in case when needed]

It's been a long time since I've looked at how much space needs to be configured based on ram, but old time recommendations were 1-2X physical RAM which is in line with this. But does that hold true with current kernels/versions? I'm asking because I honestly don't know. But 32G on a 1T drive is still "in the noise" so maybe it really doesn't matter.

cy@ · May 6, 2023

Back in the day the swap partition recommendation for most UNIX systems was 4x physical RAM. About 15-20 years ago it was updated to 1x-1.5x-2x depending on the application because more UNIX systems served apps that were sensitive to waiting for pages to be brought back from disk, i.e. web servers, etc., and fewer UNIX systems were used for time sharing like they were in the past. For systems with DBMS, 1x or no swap whatsoever. Any paging will kill database performance.

Machines with faster processors today can suffer significant performance hits due to paging because disks (including SSD and NVMe) haven't enjoyed the same performance improvement that CPUs and RAM have over the years. Ideally one wants to keep active paging to a minimum.

Sending pages to disk and never referencing them again, for instance MB or GB swap used but no or very minimal paging I/O is acceptable. When I was an IBM mainframe systems programmer, IBM's rule of thumb was, anything over 5% of system resources used to service page-ins and page-outs is too much and one will need to upgrade RAM. On Solaris and FreeBSD like to keep scan rate below 200 pages per second. Avoid page-outs. Five or ten pages per second consistent over a long period of time should be maximum. Multiple swap partitions across multiple disks across multiple controllers will help performance on memory constrained systems.

Page-ins are another matter since it uses paging I/O to read program binaries into memory using mmap(). If you see page reclaims (re in vmstat), this tells you that the machine is short of RAM but before the pages could be written out to swap they were referenced by an app making them active again; the machine is on the cusp of needing a RAM upgrade -- order additional RAM now.

Never worry about free pages (fre in vmstat). The free pool has a range the kernel is happy with. When fre drops below a threshold scan rate increases, looking for pages to place on the page-out queue. And, greater than 200 pages per second scan rate suggests you may need to install more RAM.

Higher scan rates means younger pages in memory (IBM reported the oldest page age, the inverse of the scan rate). Lower scan rates mean old pages will reside in RAM for longer -- there is plenty of RAM. ZFS ARC and UFS UBC compete for pages with processes. Memory not used by apps will be used by UFS UBC and ZFS ARC as cache to reduce re-reads of the same data of disk, greatly improving performance. Paging and its underlying hardware (dynamic address translation) is not necessarily a straightforward subject. All operating systems attempt to manage pages similarly using LRU (least recently used algorithm) each implemented differently by their developers.

Adding RAM is probably the best investment a person can make for any computer system.

Tieks · May 6, 2023

mer said:
but old time recommendations were 1-2X physical RAM which is in line with this. But does that hold true with current kernels/versions?

Same issue here. I took twice the size of RAM too in the past, but 32GB swap on a 128GB SSD is tricky when working with large audio files. It is one of the reasons I'm doing a fresh install of 13.2 instead of a freebsd-update. I let the installer decide on the size of the swap partition this time, it came up with 4GB. I suppose we better forget that twice-RAM recommendation.

mer · May 6, 2023

cy@ Thanks for that info. I've always maxed out RAM on my systems, but having swap partition available for kernel dumps is more why I have it, so that's my use case for swap sizing.
Tieks 2-4G seems to be the range the installer creates so I'm guessing that is roughly the size needed for kernel dumps

nbari · May 6, 2023

Thanks all for your time and comments, excellent learnings.

After > 12 hours the machine is back on the rack, and there is no issue so far, on a bhyve VM (where I run poudriere) I had to run:

Code:

# zpool scrub zroot
# zpool clear zroot
# zpool scrub zroot
# zpool status -x
all pools are healthy

and all is working

This is the layout for my current disk:

Code:

$ gpart show
=>        40  1953525088  nvd0  GPT  (932G)
          40      532480     1  efi  (260M)
      532520        1024     2  freebsd-boot  (512K)
      533544         984        - free -  (492K)
      534528  1952989184     3  freebsd-zfs  (931G)
  1953523712        1416        - free -  (708K)

Wondering if there is a way to "shrink" the freebsd-zfs partition to add a 4GB swap partition without reinstalling the OS. On another side, how could this be monitored/forecast from some metrics? Any advice on using telegraf/prometheus for this or similar?

nbari · Jun 11, 2023

After some days, I started to have the same problem, I reinstalled (latest 13-2) and this time using a swap size of 128GB and this time I got a core dump:

panic: Solaris(panic): zfs: attempting to increase fill beyond max; probable double add in segment [0:787f7000]

The content of info.0:

Code:

Dump header from device: /dev/nvd0p3
  Architecture: amd64
  Architecture Version: 2
  Dump Length: 4681056256
  Blocksize: 512
  Compression: none
  Dumptime: 2023-06-11 11:44:47 +0000
  Hostname: home
  Magic: FreeBSD Kernel Dump
  Version String: FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC
  Panic String: Solaris(panic): zfs: attempting to increase fill beyond max; probable double add in segment [0:787f7000]
  Dump Parity: 2582495081
  Bounds: 0
  Dump Status: good

I can mount in read-only mode and so far have no crashes:

Code:

zpool import -o readonly=on tank

It seems related to https://github.com/openzfs/zfs/issues/13483

How could I share the dump (4681056256 Jun 11 13:47 vmcore.0) ~4.4GB, cy@ , cracauer@?

The output of: kgdb /boot/kernel/kernel /var/crash/vmcore.0

Code:

panic: Solaris(panic): zfs: attempting to increase fill beyond max; probable double add in segment [0:787f7000]
cpuid = 20
time = 1686483887
KDB: stack backtrace:
#0 0xffffffff80c53dc5 at kdb_backtrace+0x65
#1 0xffffffff80c06741 at vpanic+0x151
#2 0xffffffff80c065e3 at panic+0x43
#3 0xffffffff82164bcb at vcmn_err+0xeb
#4 0xffffffff8224f549 at zfs_panic_recover+0x59
#5 0xffffffff8222b10a at range_tree_adjust_fill+0x29a
#6 0xffffffff8222b5a4 at range_tree_add_impl+0x204
#7 0xffffffff82216517 at scan_io_queue_insert_impl+0xa7
#8 0xffffffff82215b43 at dsl_scan_scrub_cb+0xa63
#9 0xffffffff82217f3c at dsl_scan_visitbp+0x4cc
#10 0xffffffff82217e96 at dsl_scan_visitbp+0x426
#11 0xffffffff822180f3 at dsl_scan_visitbp+0x683
#12 0xffffffff82217e96 at dsl_scan_visitbp+0x426
#13 0xffffffff82217e96 at dsl_scan_visitbp+0x426
#14 0xffffffff82217e96 at dsl_scan_visitbp+0x426
#15 0xffffffff82217e96 at dsl_scan_visitbp+0x426
#16 0xffffffff82217e96 at dsl_scan_visitbp+0x426
#17 0xffffffff82217dae at dsl_scan_visitbp+0x33e
Uptime: 5m11s
Dumping 4464 out of 130941 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55      /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.

I created this issue: https://github.com/openzfs/zfs/issues/14973