Complete System Freeze

I have a production FreeBSD 11 server that runs basic services including a file server. Over the weekend the whole system just froze. I have looked at all logs and can't find any problems. I plugged in a monitor and keyboard directly to the system and was not offered a login prompt. I power cycled the system and everything returned to normal.

I have read another thread on this site talking about a similar problem but it did not answer the problem for me. I am quite new to FreeBSD so if there is some obvious place I should be looking then please mention it. The system is very new and very lightly taxed. I do not think I am any where near running out of memory, but if there is someway to know what the status of memory was before the freeze I don't know what it is.
 
SirDice. Nope. Not a virtual machine. Just a plain hold headless server. It has 4 SD discs under ZFS, plenty of ram and is not worked hard. This is the second time it has happened. The first was about a month ago. There are no signs of trouble prior to the freeze.
 
What kind of hardware does the machine have? Could it possibly be heat related?
 
The system is in a server closet along with several other computers. I have never had a system fail in the past like this but anything is possible. That being said, just walking into the closet it is not uncomfortable. Is that sort of information logged anywhere? I now on certain BIOS it shows the computer's temperature. I don't think the BIOS on that machine does. Are there any typical types of errors or log messages that sometimes occur if you have a temperature problem?
 
Although this does not always apply but a full system freeze often points at a memory problem somewhere. My advice would be to grab some kind of memory tester and have that run for a while. It'll be a drag because this will take its time but it's a sure way to rule out these issues.
 
Next time this happens, look at the messages on the console. I've had boxes crash, and from the outside they are mostly indistinguishable from hung, but if you look at the console, you see error messages (for example a kernel panic). Sometimes when this happens, the console's "page up" key still works, and you can page back and see the whole problem develop (that by the way means that some piece of the code is still running). Sometimes you can still ping the machine, but nothing else functions on the network (no login, doesn't serve new connection requests on other protocols). Finding out what still works may help diagnose the root cause.
 
I will pay more attention to those log messages on the screen next time. Another idea I had to see if the system is totally hung would be to start a cron job that rights a message to a file on a remote server every half hour or so. If something was still alive in there the cron messages may continue. Of course, this only happens maybe once a month so I would be waiting a while for results, if they ever come at all.
There is no possibility that ZFS is related in any way? I don't have any reason to suspect it other than it is the only major thing that is different that a "typical" traditional setup.
 
I have 16 Gib of ram on my system. I have found on the Internet a few people mentioning that ZFS can cause the system to freeze if the ZFS ARC size is not appropriate. I am reading what I can on this topic now and learning how to potentially adjust it, but if this means anything to anyone please chime in.


I found this page that seems to be on target, assuming I even have a ZFS problem. The test values described at the bottom of the page assume 1 Gib of memory. I could just scale up their numbers I guess.

If this was in fact a ZFS problem how could I prove it? Are these sorts of errors logged anywhere?
 
I very much doubt it's ZFS. Not unless you enabled dedup. I have an 11 TB RAID-Z pool on an 8GB server. Never had any memory issues. Never tweaked anything either. The only time I had weird stalls was when I had dedup enabled. But the stalling was limited to I/O. Network was still functional, I could ping the box but not login remotely.
 
Long overdue follow up: I ended up taking the computer to a repair shop to have them look at it. Given that the problem sometimes took a month to present itself it was difficult to fix. I left the system in the shop for five months. They could not figure it out either since it continued to happen even after each piece of hardware in the system was replaced one by one, except for the mother board and processor. Eventually, they ended up sending it back to the manufacturer (of the motherboard) who kept it for a few days and then declared it to be working (since nothing went wrong over the course of three days...but given the nature of the problem this proved nothing of course.) The guy from the repair shop says that he was never able to get anyone from the manufacturer to say exactly what they did, but he believes that they updated some of the firmware. It has been working fine ever since. A pretty anti-climactic ending, but that is how it went.
 
Was it worth it? I mean, did you end up spending more than buying a new computer by going to a repair shop and/or sending it back?

I had a Clevo laptop (haswell i7 46XX with 4 cores--I've forgotten the details) that was having random reboots (on any OS: I tested FreeBSD 10, Windows, and Linux) and went through a similar hassle of replacing RAM and hard disks. It was the first (and last) laptop I bought brand new. I suspect there is some damage to the CPU/mainboard because I found a huge puddle of thermal compound on the board which probably shorted and fried something. Cleaned it up as well as I could, now the system is much cooler but it is still flakey. /var/log/whatever shows nothing unusual. Linux's mcelog never logged anything. Some processor testing utility from Intel said it was good.

I didn't even try a repair shop. Most of them I have passed by look like they specialize in phones and just, well...., seem dodgy.

The retailer I got the laptop from wants 300-400 USD to fix it. Getting a replacement CPU to swap in is also well over 300 USD. So I just got a new laptop and the old Clevo is waiting in the closet waiting for a drop in haswell cpu prices.
 
Similar story: Many years ago, our research group had custom motherboards manufactured. Logically, they were pretty standard AMD K6 motherboards, with CPU socket, DIMMs, peripheral connectors, power distribution. Should be boring, right?

Most worked excellently. Some didn't work at all, and we couldn't bring them up. Some were flaky. After a while we discovered that the flaky ones would change behavior (crash, or start working again) when you mechanically flexed the boards. Since we didn't have time to debug that kind of nonsense, we simply screwed them down tight so they wouldn't move, but they still occasionally crashed. Probably mechanical stress induced by thermal variations, based on both environmental factors, and workload changes.

Eventually, we ran out of motherboards, and manufacturing another full run of them was too expensive, but we needed a few more. Being in Silicon Valley, we asked a local consulting company whether they could diagnose what was wrong with the completely defective boards, and whether they could revive them. After X-raying the boards, they discovered that the soldering had been done badly: Some of the BGA chips and sockets had not actually fully melted solder connections, but unmelted solder balls stuck between the boards and the pads. They ended up repairing a few boards by locally reheating them, until the X-rays looked good. This wasn't cheap (thousands of $ per board to repair), but manufacturing a new batch of boards would have been hundreds of thousands, so we got a dozen working boards out of it. We figured out the root cause of the problem very quickly: the boards had been "assembled" (meaning placing the chips and soldering them into place, which today is a very high-tech process) by an in-house group, which had fundamentally no process control, no quality mindset, and no experience with boards as big and complex as a whole motherboard.

What do we learn from this? Electronics can be unreliable, if it is manufactured sloppily. Fixing an assembled PC board is very hard and expensive, but possible. Only buy from high-quality suppliers: the few $$$ extra you spend up front saves you lots of aggravation later. And even good and expensive suppliers can have such problems occasionally; a while ago, Apple had issues with the graphics chips on some laptops having solder joint problems.
 
We had a not-so-similar story but I think still worth sharing here: We had some desktops (running various operating systems) that, after some time without any glitches, started showing inexplicable console freezes, from time to time.

When those freezes happened, the machines remained perfectly accessible from network but didn't react to either keyboard or mouse commands.

After a lot of tests we discovered that a certain mouse brand (whose name I'll not say in public), after some wearing, had a cord that sometimes briefly short-circuited the power and signal wires off the USB port where it was plugged.

That brief short-circuit was enough to put the chip controlling the USB bus out off service until the next power cycle. Without the USB bus both mouse and keyboard became dead in the water, and that was what made us think the console was frozen.

We get rid of those crappy mouses and the freezing never happened again.
 
We once had a system that only worked in winter, not summer and was flaky in between. Customer assured us that it was a tried and tested design and it was our problem. Turned out that some pull down was missing and the air connectivity of humid summer air was enough to fix is, but dry winter air would not. Tried and tested, maybe. But to what result?
 
Back
Top