ProLiant DL360 G5 + FreeBSD 11.0-RELEASE-p1 + MPD reboots exactly after 24 hours

Vladimir Vlasov · Apr 16, 2017

Hello

Having ProLiant DL360 G5 + FreeBSD 11.0-RELEASE-p1 + MPD + OpenBGP
Server config:
CPU: Intel(R) Xeon(R) CPU 5160 @ 3.00GHz - 2x2 Core
RAM: 8GB
Network: HP NC373i Multifunction Gigabit Server Adapter (bce driver)

Turned up MPD service, incoming connections 550-580, also server runs BGP service,
Peak CPU load less than 40%
Peak Network load near 600Mbit (bce0 - uplink, bce1 - vlans, MPD incoming)
Peak PPS near 120k

System reboots strictly after 24 hours from it's start.
messages right before reboot:

Code:

Apr 14 14:10:37 mpd-bgp kernel: bce0: /usr/src/sys/dev/bce/if_bce.c(7886): Watch
dog timeout occurred, resetting!
Apr 14 14:10:37 mpd-bgp kernel: bce0: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan3212: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: bce1: /usr/src/sys/dev/bce/if_bce.c(7886): Watch
dog timeout occurred, resetting!
Apr 14 14:10:37 mpd-bgp kernel: bce1: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2010: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2008: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2009: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan1999: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan4: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2680: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2015: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2152: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan7: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2153: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2013: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2543: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2320: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2002: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2542: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2151: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2003: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2541: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan10: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2540: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2520: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2021: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2006: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2521: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2020: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2007: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2102: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2760: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2101: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2004: link state changed to DOWN
Apr 14 14:10:37 mpd-bgp kernel: vlan2005: link state changed to DOWN
Apr 14 14:10:38 mpd-bgp kernel: bce0: discard frame w/o leading ethernet header (len 0 pkt len 0)

then goes system starting messages.

kernel builded with no IPv6, I've tryed many varies of sysctl parameters.
current
/boot/loader.conf

Code:

geom_mirror_load="YES"
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
zfs_load="YES"

net.link.ifqmaxlen=2048

net.isr.maxthreads=2
net.isr.bindthreads=1

kern.maxusers=1024
net.graph.maxdata=65536
net.graph.maxalloc=65536

net.inet.tcp.soreceive_stream=1

hw.bce.verbose=1
hw.bce.tso_enable=0
hw.pci.enable_msix=0

/etc/sysctl.conf

Code:

vfs.zfs.arc_max=4294967296

net.inet.tcp.sendspace=131072
net.inet.tcp.recvspace=131072

net.inet.icmp.drop_redirect=1
kern.ipc.somaxconn=32768
net.inet.tcp.sendbuf_inc=16384
kern.ipc.maxsockbuf=2621440
net.graph.recvspace=1024000
net.graph.maxdgram=1024000

net.inet.ip.portrange.first=1024
net.inet.ip.portrange.last=65535
kern.ipc.nmbclusters=262144
kern.maxvnodes=1000000
net.inet.tcp.maxtcptw=280960
net.inet.tcp.nolocaltimewait=1
net.inet.icmp.icmplim=2000

security.bsd.see_other_uids=0
security.bsd.unprivileged_read_msgbuf=0
security.bsd.unprivileged_proc_debug=0

There's absolutely no relations between reboots and system load. 14:00 PM is a time with less then quarter of peak system load. CPU ~10%, Network ~140Mbit/s, RAM ~2GB (of 8GB total RAM).
there's no visible reasons for that but exactly after 24 hours system reboots.

I seen examples with triple connections number to MPD and double traffic load but there was Intel 82576 (with igb driver).

I can't see another reason than network, but why 24 hrs?!
It breaks my brain.

At the lists.freebsd.org I found topic about bce Watchdog timeout
https://lists.freebsd.org/pipermail/freebsd-stable/2015-April/082268.html

Code:

This may be caused by DMA alignment problems.
See
https://docs.freebsd.org/cgi/getmsg.cgi?fetch=145859+0+archive/2015/freebsd-stable/20150419.freebsd-stable
for a recent thread about the msk driver.  The msk maintainer Yonghyeon
Pyun has opted for super safe options of 32K alignment!

It's a long shot, but you could try increasing BCE_DMA_ALIGN and/or
BCE_RX_BUF_ALIGN in the include file if_bcereg.h, say up to 4096, to see
whether it makes any difference.

Can it be helpful and if I will make these changes to if_bcereg.h don't it breaks network subsystem?

Did anyone seen something similar?

I really love FreeBSD and don't want to migrate to linux ((((

please F1

k.jacker · Apr 17, 2017

Hey Vladimir,
to me it looks like the reboot after 24h is not related to systemload, but it's the watchdog that times out because your mdp/bgp seems to not function properly.
The Watchdog timer is something that gets enabled in the BIOS and a timeout can be chosen, yours is maybe set to 24 hours. The watchdog timer will increase all the time while beeing reset by the OS from time to time. When the given timeout value i reached (because the computer or a prosess hangs and the timer is not reset in time) then a reboot is initiated, if that doesn't work a hard reset will occur.
Without beeing a watchdog expert, as long as your server gets rebootet normally, I would jump into the BIOS, disable the watchdog and then see if you can track down the problem with your mpd/bgp service. Looks like it's the cause for the watchdog timeout.

I am not at home at the moment so not sure... but i think there might be a /dev/watchdog where you could check out the current value and see if it gets reset from time to time.

Matthias

Vladimir Vlasov · Apr 17, 2017

Thank you for your attention

I've checked /dev/watchdog, but it's not exist.
service watchdogd not running too

unfortunately it seems like broadcom internal watchdog ((

I can't check BIOS settings cause it's full of users server and it's using actively right now ((

ProLiant DL360 G5 + FreeBSD 11.0-RELEASE-p1 + MPD reboots exactly after 24 hours

Vladimir Vlasov

k.jacker

Vladimir Vlasov