FreeBSD intermittent crash!

Hi,

We're new to FreeBSD as well as this forum, so please pardon me for any wrong here.

We've switched to FreeBSD recently because of its improved ARC caching and asynchronous performance but so far our experience is not very good with it. It crashes every 2-3 days and we're unable to track down the problem. The server specs are pretty high :


Supermicro X5690 (12 cores, 24 threads - 2u)
96GB RAM
12x3TB RAID-10 (HBA-LSI9211)
X8DT3 Board
Supermicro PS- 902-1R 900W

Here is the screenshot of recent crash :

http://prntscr.com/9er3pk

One thing worth mentioning is, before going down there's not load on server, more or less free RAM usually is around 12GB.

Please guys help us out to resolve this issue. Its really killing us
frown.gif
 
Last edited:
Thanks for quick response. Here it is :
Code:
FreeBSD 10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014     root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
 
I'll let the more experienced folks suggest what could be the cause of the crash, but it looks like you are using a base un-patched version of 10.1-RELEASE. Consider applying the patches. One of my machines for example is running 10.1-RELEASE-p24, and its likely there another higher patched release available. It might be possible the cause of your issue has already been addressed, and maintaining an up-to-date patched system is a good practice.

Starting reading this whole document as time permits, but patching and updating are found in Section 23.2.

http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/
 
Thanks a lot. I am about to apply patches. I'll update here soon. :)

Meanwhile looking forward to more expert advice. :)
 
You may want to check for hardware errors, the mca_intr call refers to Machine Check Architecture. But I'm not sure if it's the cause of the panic(9).

You said it crashes ever 2-3 days, but the picture shows it has been up almost 12 days.
 
Right, thanks. Regarding updating , I've found tons of patches and about to update now but one point is very much important before upgrade take place. Is there any chance of zpool corruption after the upgrade ? We've around 16TB data in the zpool. Sorry for newbie question, but I am newbie to FreeBSD.
 
There shouldn't be a risk to your existing pools, but it's always a good idea to have proper backups of course.
 
You may encounter minor shock if:
  • System boots from given pool
  • ZFS was upgraded between version you had before and you upgraded into
  • You run # zpool upgrade and you forget to also upgrade boot code

This may render your system unbootable, because boot code would not be able to read the ZFS filesystem from which system has to boot. However data will not be lost and you can fix it by booting live system new enough to contain the same version of the ZFS as your new pool and run # gpart bootcode <required params>.
 
I am trying to update the system using freebsd-update(8) install but output is really insane :

Code:
Installing updates...install: ///usr/src/contrib/file/magic/Magdir/kerberos: No such file or directory
install: ///usr/src/contrib/file/magic/Magdir/meteorological: No such file or directory
install: ///usr/src/contrib/file/magic/Magdir/qt: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver40-ja.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver45.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/driver46.html: No such file or directory
install: ///usr/src/contrib/ntp/html/drivers/mx4200data.html: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/accopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/audio.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/authopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/clockopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/command.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/config.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/confopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/external.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/hand.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/install.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/manual.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/misc.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/miscopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/monopt.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/refclock.txt: No such file or directory
install: ///usr/src/contrib/ntp/html/scripts/special.txt: No such file or directory
install: ///usr/src/contrib/ntp/include/declcond.h: No such file or directory
install: ///usr/src/contrib/ntp/include/intreswork.h: No such file or directory
install: ///usr/src/contrib/ntp/include/lib_strbuf.h: No such file or directory
install: ///usr/src/contrib/ntp/include/libntp.h: No such file or directory
install: ///usr/src/contrib/ntp/include/ntp_assert.h: No such file or directory
 
You may encounter minor shock if
- system boots from given pool
- ZFS was upgraded between version you had before and you upgraded into
- you run # zpool upgrade
- and you forget to also upgrade boot code

This may render your system unbootable, because boot code would not be able to read the ZFS filesystem from which system has to boot. However data will not be lost and you can fix it by booting live system new enough to contain the same version of the ZFS as your new pool and run # gpart bootcode <required params>.

I haven't seen a zpool upgrade happen within security/errata patches. If he upgraded from 10.1-RELEASE to 10.2-RELEASE this might be the case.

All of my research on MCA panics reported to the mailing lists so far seem to indicate a hardware issue -- usually processor.

As a workaround, just create the missing directories and run the fetch and install commands again.

You can just remove src from Components in /etc/freebsd-update.conf to make those messages go away with future updates.
 
Thanks guys for work around, I created missing directories and updated and rebooted the OS.

Code:
[root@cw001 ~]# uname -a
FreeBSD 10.1-RELEASE-p24 FreeBSD 10.1-RELEASE-p24 #0: Mon Nov  2 12:17:28 UTC 2015     [EMAIL]root@amd64-builder.daemonology.net[/EMAIL]:/usr/obj/usr/src/sys/GENERIC  amd64
 
I haven't seen a zpool upgrade happen within security/errata patches. If he upgraded from 10.1-RELEASE to 10.2-RELEASE this might be the case.

All of my research on MCA panics reported to the mailing lists so far seem to indicate a hardware issue -- usually processor.



You can just remove src from Components in /etc/freebsd-update.conf to make those messages go away with future updates.

Thanks for tips. I'll monitor this server downtime to see if it crash again ?
 
Thanks for tips. I'll monitor this server downtime to see if it crash again ?

Of course, and don't forget that SirDice asked you to check for hardware errors. Regardless if you have crashes or not, updating to the latest patch release is a good practice.
 
You're missing the important information regarding the crash in the picture - the message. Only backtrace is shown.
Do you have dump configured ? If so you can find the text info in /var/crash/core.txt.$N by default, where N is the number of last crash.

If you don't have it set look at dumpdev in /etc/rc.conf configuration.

Does it crash regularly (though 12d uptime doesn't fix the "crashes every two days" criteria). ?
Is some heavy job scheduled to be run during that period ? You said no - were you logged just before it crashed monitoring ?

When it comes to FreeBSD I'd push for 10.2 version as guys are improving performance every release.

As SirDice mentioned already - do check for the HW issues. Especially with non-ecc RAM. Running Memtest+ for few hours, etc. could show possible memory issues.
 
You're missing the important information regarding the crash in the picture - the message. Only backtrace is shown.
Do you have dump configured ? If so you can find the text info in /var/crash/core.txt.$N by default, where N is the number of last crash.

If you don't have it set look at dumpdev in /etc/rc.conf configuration.

Does it crash regularly (though 12d uptime doesn't fix the "crashes every two days" criteria). ?
Is some heavy job scheduled to be run during that period ? You said no - were you logged just before it crashed monitoring ?

When it comes to FreeBSD I'd push for 10.2 version as guys are improving performance every release.

As SirDice mentioned already - do check for the HW issues. Especially with non-ecc RAM. Running Memtest+ for few hours, etc. could show possible memory issues.

Thanks for detailed answer. Yes dump is configured and I can find a big core.txt.0 text file. Now, I don't know how to debug it in order to find the bottleneck of crash. So i am attaching here.
 

Attachments

Last edited by a moderator:
I was looking for the panic string only. Information in the core.txt is confidential to some state. Nowadays I'd be more paranoid than not.
Remove it from attachment.

Interesting part is:

Code:
panic: Unrecoverable machine check exception

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 5
MCA: CPU 5 UNCOR PCC internal timer error
MCA: Address 0x802bf6e59
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 4
MCA: CPU 4 UNCOR PCC internal timer error
MCA: Address 0x802bf6e59
MCA: Misc 0x0
panic: Unrecoverable machine check exception
cpuid = 7

You can actually search forums here for this MCA string. At first look I would focus on CPU and memory.

Now is it a false alarm or does it really hit problem with HW? Don't know right now, I would need to google around too. Some searches suggest issue with fw (bios) on motherboard. I'd check that too (compare FW/bios version of the board to the vendor's last update, etc..).
 
My first assumption would be that there is really something wrong with the hardware and take the issue to the manufacturer of the motherboard, look at the documentation and their web support.
 
Guys, again the same server got rebooted on its own and zpool didn't even mounted itself though it is enabled in rc.conf and loaded in loader.conf. Here is the panic log :

Code:
panic: Unrecoverable machine check exception

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
MCA: CPU 2 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
panic: Unrecoverable machine check exception
cpuid = 8
KDB: stack backtrace:
#0 0xffffffff80962d90 at kdb_backtrace+0x60
#1 0xffffffff80927eb5 at panic+0x155
#2 0xffffffff80e3bfeb at mca_intr+0x6b
#3 0xffffffff80d24c09 at trap+0x99
#4 0xffffffff80d0aec2 at calltrap+0x8
#5 0xffffffff80361eea at acpi_cpu_idle+0x13a
#6 0xffffffff80d0f89f at cpu_idle_acpi+0x3f
#7 0xffffffff80d0f940 at cpu_idle+0x90
#8 0xffffffff80953585 at sched_idletd+0x1d5
#9 0xffffffff808f88fa at fork_exit+0x9a
#10 0xffffffff80d0b3fe at fork_trampoline+0xe
----------------------

Where should I look :( , some ppl people are suggesting to disable MCA panic using hw.mca.enabled=0″ to the file /boot/loader.conf.

Please help :(
 
Don't disable MCA, it's reporting hardware errors. You can use sysutils/mcelog to translate those MCA messages:
Code:
dice@test:~ % mcelog --ascii --no-syslog
mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0
MCA: Bank 5, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x206c2, APIC ID 2
MCA: CPU 2 UNCOR PCC internal timer error
MCA: Address 0x802bf6a69
MCA: Misc 0x0HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5
MISC 0 ADDR 802bf6a69
MCG status:MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0
CPUID Vendor Intel Family 6 Model 44
 
Thanks, here is the output :

Code:
[root@cw001 /var/crash]# mcelog --no-dmi --ascii --file core.txt.1 
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 3 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 5 
MISC 0 ADDR 802bf6a69 
MCG status:MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Internal Timer error
STATUS be00000000800400 MCGSTATUS 4
MCGCAP 1c09 APICID 2 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 44
 
You have hardware errors. No amount of fiddling with software settings is going to change that fact.
 
Thanks SirDice, is there a way iI can find out the specific hardware component which is causing this panic?
 
Last edited by a moderator:
Back
Top