Random system freezes

Hi,

I just upgraded from 10.3-RELEASE to 11.0-RELEASE using freebsd-update a couple of days ago.
All was fine after the upgrade process, but like 3 hours later the system freezes. No messages on logs or console.

Since then it keeps freezing randomly, sometimes after boot, sometimes it lasts longer (no more than 2.5 or 3 hours).

This system is a file server with a couple of jails. ZFS on root and little more on it. dmesg is attached.

I've tried to run watchdogd, but it seems completely ignored. The system keeps hanging randomly.

As this is pretty obscure, I would like to ask the forum what are the next steps to get more information in order to solve the problem.

Cheers.
 

Attachments

I just upgraded from 10.3-RELEASE to 11.0-RELEASE using freebsd-update a couple of days ago.
All was fine after the upgrade process, but like 3 hours later the system freezes. No messages on logs or console.

As this is pretty obscure, I would like to ask the forum what are the next steps to get more information in order to solve the problem.
First, see if there is a newer BIOS available for your system. There are a couple warnings in your attachment that might be fixed by a BIOS upgrade (but which may not be related to the hang you're experiencing).

The next thing to do is to see if the system is stuck in an I/O wait (usually disk), hung with interrupts enabled, or hung with interrupts disabled.

Once the system boots, immediately start a task running on the console which continuously displays screen output, such as top(1) or systat(1). If that display keeps on updating when your system freezes, you're likely experiencing an I/O wait that isn't completing for some reason. A wait for disk I/O will prevent new processes from starting and possibly wedge existing processes if there's swapping going on.

If the program stops updating its screen output, use the Alt-F2 (or Alt-other-F-key) sequence to try to switch to a different console. If that works, interrupt processing is still working.

If the Alt key sequence has no effect, then the system is hung with (at least some) interrupts disabled.

Once you know which of those 3 states the system is in, it should be possible to suggest further debugging steps. The last is the hardest to deal with, as it is nearly impossible to get the system's attention in order to get into the debugger.

An alternative method is to build and boot a kernel with debugging options like WITNESS and INVARIANTS enabled and see if you get some sort of useful message. In my experience, you'll get a bunch of "red herring" messages that don't bear on the problem and the underlying problem usually turns into a Heisenbug when you are running a debug kernel.
 
First, see if there is a newer BIOS available for your system. There are a couple warnings in your attachment that might be fixed by a BIOS upgrade (but which may not be related to the hang you're experiencing).

Done. dmesg does not change much. But I have the last BIOS firmware now.

The next thing to do is to see if the system is stuck in an I/O wait (usually disk), hung with interrupts enabled, or hung with interrupts disabled.

Once the system boots, immediately start a task running on the console which continuously displays screen output, such as top(1) or systat(1). If that display keeps on updating when your system freezes, you're likely experiencing an I/O wait that isn't completing for some reason. A wait for disk I/O will prevent new processes from starting and possibly wedge existing processes if there's swapping going on.

If the program stops updating its screen output, use the Alt-F2 (or Alt-other-F-key) sequence to try to switch to a different console. If that works, interrupt processing is still working.

If the Alt key sequence has no effect, then the system is hung with (at least some) interrupts disabled.

Once you know which of those 3 states the system is in, it should be possible to suggest further debugging steps. The last is the hardest to deal with, as it is nearly impossible to get the system's attention in order to get into the debugger.

Is the last one :(
It's totally frozen.

I've tried something. I disabled all third party daemons and started them one by one (keeping at least 3 hours between them) to isolate and see if one of them is responsible.
I've found that if net/syncthing is up, the system freezes randomly. But it works correctly without it. All the other daemons give no problems (smb, and a couple more).

It's weird, because this software was working (the very same version) with 10.3-RELEASE. I've been using it for months now.
I'm not sure this is syncthing's fault, but I have to investigate further. I'll also let the port maintainer (and probably the authors) know about this.

Anyway, now I've a stable system again, but with less features than before :confused:
 
Is the last one :(
It's totally frozen.
Ok.
I've tried something. I disabled all third party daemons and started them one by one (keeping at least 3 hours between them) to isolate and see if one of them is responsible.
I've found that if net/syncthing is up, the system freezes randomly. But it works correctly without it. All the other daemons give no problems (smb, and a couple more).
I'm not familiar with that port (other than viewing the project's web page briefly). Is it possible to start it but not have it perform any tasks (for example, by eliminating parts of the configuration file or however it determines what to do)? It would be interesting to see if the problem arises from simply having the daemon started. It might be possible to further examine this if the daemon has various "tell me what you're going to do, but don't do it" type options, to see what part of its processing causes the issue.
It's weird, because this software was working (the very same version) with 10.3-RELEASE. I've been using it for months now.
I'm not sure this is syncthing's fault, but I have to investigate further. I'll also let the port maintainer (and probably the authors) know about this.
I see you filed PR 213953 on this. I added some commentary pointing people to this discussion and suggesting the PR may need to be re-categorized as base/kern if nothing obvious shows up in the port itself. I also subscribed to the PR, so I'll see followups from either place.
 
Some more info on this.
I've run a memtest86 on this machine (as suggested by a member of the freebsd-es mailing list). Memory seems to be in perfect condition.
Also, I've tried to replicate the error on a VM with a fresh 11.0-RELEASE install without success. On the VM all seems to be working fine.
That makes me think of some change between 10 and 11 related to the hardware I'm using.

I'm trying to make the server crash while the syncthing daemon is in verbose mode (at least as verbose as I can make it), but for now no useful data arise.

I'll post new data if I find something.
 
I'm experiencing similar symptoms on 2x HP DL120 G7 servers. I'll rule out RAM since they have RAM from different makers and in different quantities. BIOS version has been updated as well.

I cannot make the systems crash when idle. But if I start to run jails, they could run just as the jail is started during system boot, or it may be 90 minutes later. If I stop the jails I'm fine.

FreeBSD 11.0-Release-p8
ZFS root on internal SATA drives
py-iocage for the jail management.
 
Just chiming in to say that I'm experiencing a similar issue on a vpc running FreeBSD 11.0-Release-p8.

The hang seems only to happen during intense disk I/O, latest was when applying patches in freebsd-update.

Trying now to disable `net/syncthing` which is the only additional service i've added to the server since this started happening.
 
We have the same experiences. On our four servers after upgrade/or clean install (FreeBSD 11.0-Release-p10) unexpected reboots after day or two are coming. All servers are supermicro with xeon processors, intel igb net cards, zfs root. It looks like the reboot frequency depends on network traffic - heavy traffic, higher reboot frequency. We have observed this reboot looks like reset pressed. No messages, simply switch off and then reboot.
Very strange ...
 
We have the same problem. 11.2 #4
Using ezjail & zfs on root with another zpool for mass storage.
Crash happens about every 4 days.
Nothing relevant in the logs.
Server communicates constantly with IoT devices and records into databases on server.
Plenty of disk operations going on.
Thought it was ZFS eating all the RAM, so I capped it and still I get this crash.

It has also crashed when our software isn't running! Just while moving a file! Which is why we think ZFS is up to no good. Before upgrading from 11.1 we never had this problem.
 
I had a freezing problem on 11.2 using a mix of ZFS (root) and UFS (all other data discs). Did not find the culprit. Now running 12 (all UFS) on the same computer and 2 others and no problems so far.
 
Tankist, have you watched how ZFS, by default, will eat all the RAM? Some have hailed it as a feature; pragmatic folks, myself included, just want stability.

Below are links to limit this behavior.
I really thought this was my solution, but I had another crash since.
Very concerned how such a basic feature, the filesystem, can be so unstable by default.
https://forums.freebsd.org/threads/zfs-arc-max-depeding-on-available-ressources.63247/

https://forums.freebsd.org/threads/...-zfs-memory-usage-to-reasonable-limits.64445/

https://wiki.freebsd.org/ZFSTuningGuide
 
Tankist, have you watched how ZFS, by default, will eat all the RAM? Some have hailed it as a feature; pragmatic folks, myself included, just want stability.

Below are links to limit this behavior.
I really thought this was my solution, but I had another crash since.
Very concerned how such a basic feature, the filesystem, can be so unstable by default.
https://forums.freebsd.org/threads/zfs-arc-max-depeding-on-available-ressources.63247/

https://forums.freebsd.org/threads/...-zfs-memory-usage-to-reasonable-limits.64445/

https://wiki.freebsd.org/ZFSTuningGuide
Yes, when I used ZFS I always had to limit the max ARC size. These days I decided I don’t need ZFS overhead to run desktop at home.
 
Back
Top