Debugging system hang-ups

I'm running into a problem where my system (running 11.0-RELEASE-p9) becomes unresponsive and I'm looking for some advice on how to debug it.

When it happens, the system does not respond to any inputs. I am always in X when it happens, but I can't say whether X is actually the problem. No keyboard or mouse input seems to be recognized, and I can't ssh into the system (it always times out).

All that I can tell is that the fans on my system kick in, which suggests that CPU usage is very high. Again, I have no idea what is actually using it.

I do not know how to trigger the problem. It is an intermittent issue. I've checked the logs and nothing of note is listed. dmesg doesn't report anything out of the ordinary. (I find it very odd that seemingly nothing gets logged.) I've let the system run in this state for up to 20 minutes and nothing changes.

I'm not asking here for help on debugging exactly why my system is doing this (which is why I have deliberately left out what I'm running on the system). I'm looking for advice on what I can use or do to debug it. Is there an option I can pass to the kernel to get more logging or debug info? If so, how much logging output will it produce and can I force it to be written to a file? And so forth.

In short, what's an accepted/effective way to collect information on a system that hangs intermittently, and does not allow introspection when the hanging occurs?

For the record, here's the output of uname -a. I'm happy to provide more info if anyone thinks it will help.
Code:
FreeBSD freebsd.local 11.0-RELEASE-p9 FreeBSD 11.0-RELEASE-p9 #0: Tue Apr 11 08:48:40 UTC 2017     root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC  amd64
 
All that I can tell is that the fans on my system kick in, which suggests that CPU usage is very high.
This might actually be a symptom, not a cause. When the system hangs the CPU fans can default to full-on (to prevent further damage; there's no control any more so no way to dynamically adjust the speed in relation to the temperature).

Try setting the sysctl(8) debug.kdb.enter to 1. Then, when it hangs, try hitting ctrl-alt-escape. As far as I know this should be possible with a GENERIC kernel, but I'm not 100% sure. Try it when the system still runs normally. The key combination should drop you into the kernel debugger. If it works normally but not when it hangs it hangs really good and I would suspect some hardware issues.
 
Thanks, I'll give that a try and see what I find out. It's happened again since I started the topic and I managed to rule out some guesses. I also had a process listing viewable when it happened, and didn't see anything strange. I'm expecting something hardware/driver related if my embedded experience is any indication.
 
For the record, kernel debugging isn't enabled on GENERIC (well, the one I have, anyway).

Code:
root@freebsd:~ # sysctl debug.kdb.enter=1
debug.kdb.enter: 0 -> 0
root@freebsd:~ # sysctl debug.kdb
debug.kdb.alt_break_to_debugger: 0
debug.kdb.break_to_debugger: 0
debug.kdb.trap_code: 0
debug.kdb.trap: 0
debug.kdb.panic: 0
debug.kdb.enter: 0
debug.kdb.current:
debug.kdb.available:

I'll go about building a kernel with debugging support and do as was recommended.
 
kernel debugging isn't enabled on GENERIC (well, the one I have, anyway)
Yeah, I rechecked this, it's enabled on GENERIC for -STABLE but not on -RELEASE.

I'll go about building a kernel with debugging support and do as was recommended.
Yes, as far as I know this is the only way to find out why (or where) it's hanging. Also make sure it's not some hardware related issue, like memory errors. I also recommend checking for a BIOS/UEFI update if it's available.
 
It showed up again and didn't jump into the debugger, so I'm guessing that means whatever it is, it doesn't trigger a kernel panic.

Seems like a hardware issue that the kernel is not able to handle. The BIOS is up-to-date. I'm running memory tests just to be comprehensive, but if I had to guess, it's something to do with the display. I have an LG 34UM95 with the i915 driver. I think the driver is fine, but I wouldn't be surprised if the monitor is the problem. It's been flaky on Windows too.

I'll run headless for a while and see if that helps matters.

Thanks for telling me how to get kdb working.
 
1. What's your i915-GPU card/Model?
2. Periodically check swap status ( $ swapinfo). For example does swap usage grow after starting java-based apps, or after extended web browser use? Which browser?

It's most likely NOT the monitor that locks-up the system. If any hardware failure is an issue it will be the GPU before the monitor.

If you suspect your hardware, run a stress-test Linux distro from CD or USb.
 
It's most likely NOT the monitor that locks-up the system. If any hardware failure is an issue it will be the GPU before the monitor.

The reason I suspect it has something to do with the monitor is that if I attempt to use the audio device on it (it connects through DisplayPort), I get a screen that says "Out of range" and the display goes away. On Windows, the audio has also stopped working, even with the latest drivers. Windows sometimes doesn't detect that the monitor is there without a reboot. Sure, that could be Windows, but there are enough little things that seem to happen with it that it's making the "spidey sense" tingle.

Also, I've had it repaired once already after it failed to recognise inputs of any kind. After the repair I only ran it with Windows. It seemed fine for a while, then the little things started showing up. I put it back on my FreeBSD system (which was running without a monitor -- and without problems -- the whole time) and it wasn't long after that the lock-up problems started.

Granted, that is not a precise account but if I had to pick a prime suspect, it would be the monitor and the peripherals on it. It's the one thing that has changed on the system, aside from whatever is delivered through regular updates.

I do agree that it seems odd.

When it comes to memory usage, I don't think I've hit swap in a while. I run Firefox, Emacs, and urxvt, and not much else. I tend to kill Firefox about once a day, just as a habit. Whenever I've checked memory usage, I rarely see it getting past 8GB out of 16GB.

The GPU is the onboard GPU on an Intel i7-4790 Haswell.

I'll give the stress-test a try. Thanks for the tip.
 
The reason I asked memory use, is that the next generation (drm-next) graphics kernel has a very bad memory leak. Although you're probably not on the drm-next kernel, I was curious whether something had occurred to cause a similar symptom in your case.

As to monitor, you can always test your hypothesis by switching to a spare or, in case of laptop, connecting an external monitor & see how that performs.
 
As to monitor, you can always test your hypothesis by switching to a spare or, in case of laptop, connecting an external monitor & see how that performs.

That's the plan.

I did run a stress test using StressLinux and I didn't hit any problems. Thanks for the advice.
 
Just a follow up to close this off:

I ran the system for a week or so without the monitor attached with no problems. I now have a different monitor and have been using it for a week, also without problems.

It looks like the monitor was causing my system hangups, for whatever reason. I don't have the time to debug it, nor the inclination, so I'm not going to delve any further into this.
 
Back
Top