FreeBSD 14.2. Server halting? Crontab in queue.

amilis · Thursday at 12:49 PM

Hi!

I have a production backup server that’s running a backup shell script every 5 minutes within crontab (root). It’s running scp to different servers depending on the time that get backup file from other servers. I installed this backup server for over a year ago, no problem. I upgrade it to 14.2 now on December 21, after that I see that the server is “halting/stopping”? around 0300-0430 almost every night and sometime around 1700 (few times), witch skew up the backups that time.

I reinstalled the backup server a couple of days ago from scratch. Same halting/drop/stopping.

It’s a FreeBSD 14.2 on a vm-bhyve host (14.2 - also updated Dec 21), everything have a lot of resource, running on a HPE DL380 G10, the server is idling. Other vm’s running normal, but they don’t run crontab every 5 min, so, yea.. I don’t know if the backup server is just halting in 10-15 minutes or what, but the server puts everything in a queue. I know that Zabbix don’t get any connections from the server and Zabbix can’t connect to it. It’s just noting in the charts in this time. And there is nothing in any log on any server, except one thing, queue (see below). I check them all, vm’s host.. nothing abnormal in any log. I don’t see the problem on other vm-bhyve host and vm’s so far.

Parts of the log file /var/log/cron on the backup server (the only abnormal I see, except for messages below). Today, something happen after 0405. But what? There is no errors or strange logs on the vm-bhyve host, only on the backup server below. My script is /home/backup253/get.sh >/dev/null 2>&1 in the log file that’s runs every 5 min.

As you can see, CMD (/usr/libexec/save-entropy) running 0044 and 0055. It should runt 0011, but it’s run 04:15:45. It should runt 0022, but it run 04:24:57. My script is also queuing. Everything is put in a queue. Zabbix lost the connection 040611 and the backup server reconnected 042616. This happens once almost every night between 0300-0430, different times. My system logs is not big at all (rotating logs).

Code:

Jan  2 03:40:00 EDA-253-BackUp02 /usr/sbin/cron[65984]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 03:44:00 EDA-253-BackUp02 /usr/sbin/cron[89793]: (operator) CMD (/usr/libexec/save-entropy)
Jan  2 03:45:00 EDA-253-BackUp02 /usr/sbin/cron[97133]: (root) CMD (/usr/libexec/atrun)
Jan  2 03:45:00 EDA-253-BackUp02 /usr/sbin/cron[97976]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 03:50:00 EDA-253-BackUp02 /usr/sbin/cron[61123]: (root) CMD (/usr/libexec/atrun)
Jan  2 03:50:00 EDA-253-BackUp02 /usr/sbin/cron[61294]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 03:55:00 EDA-253-BackUp02 /usr/sbin/cron[18595]: (operator) CMD (/usr/libexec/save-entropy)
Jan  2 03:55:00 EDA-253-BackUp02 /usr/sbin/cron[19122]: (root) CMD (/usr/libexec/atrun)
Jan  2 03:55:00 EDA-253-BackUp02 /usr/sbin/cron[19142]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:00:00 EDA-253-BackUp02 /usr/sbin/cron[59380]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:00:00 EDA-253-BackUp02 /usr/sbin/cron[59485]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:01:00 EDA-253-BackUp02 /usr/sbin/cron[58425]: (root) CMD (adjkerntz -a)
Jan  2 04:05:00 EDA-253-BackUp02 /usr/sbin/cron[42438]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:05:00 EDA-253-BackUp02 /usr/sbin/cron[42735]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:12:36 EDA-253-BackUp02 /usr/sbin/cron[93476]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:15:45 EDA-253-BackUp02 /usr/sbin/cron[95036]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:15:45 EDA-253-BackUp02 /usr/sbin/cron[95360]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:15:45 EDA-253-BackUp02 /usr/sbin/cron[95883]: (operator) CMD (/usr/libexec/save-entropy)
Jan  2 04:24:57 EDA-253-BackUp02 /usr/sbin/cron[99537]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:24:57 EDA-253-BackUp02 /usr/sbin/cron[98962]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:24:57 EDA-253-BackUp02 /usr/sbin/cron[99894]: (operator) CMD (/usr/libexec/save-entropy)
Jan  2 04:24:58 EDA-253-BackUp02 /usr/sbin/cron[46]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:25:00 EDA-253-BackUp02 /usr/sbin/cron[26536]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:25:00 EDA-253-BackUp02 /usr/sbin/cron[26845]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:30:00 EDA-253-BackUp02 /usr/sbin/cron[36717]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:30:00 EDA-253-BackUp02 /usr/sbin/cron[38612]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:30:00 EDA-253-BackUp02 /usr/sbin/cron[36717]: (root) CMD (/usr/libexec/atrun)
Jan  2 04:30:00 EDA-253-BackUp02 /usr/sbin/cron[38612]: (root) CMD (/home/backup253/get.sh >/dev/null 2>&1)
Jan  2 04:31:00 EDA-253-BackUp02 /usr/sbin/cron[69126]: (root) CMD (adjkerntz -a)
Jan  2 04:33:00 EDA-253-BackUp02 /usr/sbin/cron[9016]: (operator) CMD (/usr/libexec/save-entropy)
Jan  2 04:35:00 EDA-253-BackUp02 /usr/sbin/cron[54416]: (root) CMD (/usr/libexec/atrun)

The only “error" since December 21 (the 14.2 update) I see is this in /var/log/messages:

Code:

Jan  2 04:24:57 EDA-253-BackUp02 sshd[40124]: error: beginning MaxStartups throttling

Which.. I may understand as my scripts put in a queue. This is the first, and last error I see. I may change the MaxStartups, but it’s not the problem.

I will install test servers around the retirement today and run crontab every 5 and 1 min to stress things on this host and other to see if I can see something. I will also make new cron scripts on the server that’s writes info in local files.

14.1 worked just fine (and 13.X before that update), 14.2 nope. Same configs the whole year, nothing change except for the 14.2 update. I will dig in in every logs on other servers, hosts if I missed something on the 14.2 updates server. I only updated 19 servers so far (normal servers, not running crontab every 5 min – this was the first one). I will not update any other prod-servers to 14.2 right know! :/

Could it be vm-bhyve on 14.2? It got updated in December 21. Another host is also updated, can’t see problem there, but it’s not on load. My other vm-bhyve is running 14.1 still that’s running heavy CPU stuff (vm’s up to 30 vCPU) and many vm-servers with 5 min cron jobs. No problem.

But I reach out here if anyone have the same queue-problem and some hints to debug this. It’s a production server, so I can’t do any crazy stuff. But I can install test servers (witch I will do).

Any hint?

CeXP1917 · Thursday at 3:00 PM

amilis said:
every night between 0300-0430

Maybe this is related to 'periodic' tasks, there is threads about problems with it. I had something like this, in my case security_status_pkg_checksum halted system.

amilis · Thursday at 3:16 PM

Did a very fast check on periodic, but I need to dig in to that as periodic is not my strength

I saw that everything is =”NO” in the config, there is no periodic background process running (right now).
But I need to go deeper here.
Tnx, will cheek it.

ralphbsz · Friday at 2:06 AM

Periodic typically runs in the middle of the night. I think by default they're set to 1AM or something like that. Check /etc/crontab, you will see them there.

Given the "zabbix lost connection" symptom, my guess is that something is ridiculously overloaded. Suggestion: If this happens while you are awake, and you can notice right away: Keep a few windows open on the server in which you are running top, and perhaps vmstat, iostat, or some netstat flavor. If it happens at odd time, then keep "iostat 1 >> /log.me.somewhere" and similarly vmstat 1 and netstat -w 1 -I <interface>. To get date stamps into the files, have another little shell script that does "date >> /log/me/somewhere" every minute or so. Save those files, and then see whether the content shows some interesting changes around the time of the problem.

amilis · Friday at 11:11 AM

Short version: I maybe the network bridge in vm-bhyve on 14.2 as it worked in 14.1.

Long version: It halted today again at 0300. BUT! I finely got some logs.

 Jan  3 03:30:33 EDA-253-BackUp02 kernel: nfsrv_cache_session: no session IPaddr=10.135.18.130, check NFS clients for unique /etc/hostid's

Jan  3 03:30:34 EDA-253-BackUp02 syslogd: last message repeated 2 times

I have a bridge (in rc.conf) on the host (and as a vm swicth) that I connect to a secondary interface on the backup server that share nfs (internal), witch go to another server that get the files. The backup server halted 0305 and there is no logs than above for 25 minutes. But I also noted that the other server also halted on the same time, so I must be something with the bridge going down or something that make this two server in a halt-mode. I don’t think it’s nfs as the logs says, the interface is not there so nfs is not working. And the backup server should run without nfs, it should not halt. There is no other logs, not on the vm-bhyve host either, so I don’t think it’s the “host” bridge, but maybe the vm switch bridge (network) in vm-bhyve. The servers is halting before any messages gets to any logs.
This is strange... It worked on 14.1, so something is change in 14.2.

About date stamps. I did that within the shell backup script so I know when and were it halted. It’s stopped in the middle random, and build up a queue for 25 min, and then it run everything after that. No indications before/after of what might have caused it. I will setup script as you recommend, but I think they will halt as everything else. Worth a try. And I don’t think I can connect to the server as both the interfaces is not there. But I may connect from the host vm console server.. 0300, sigh, I don't want to..

I made a testserver on the host with 30 cpu and 128gb ram, os disk and a big vdev hdd as the other big vm-servers. Made a cronjob every 5 min with:

 #!/bin/sh

File=$(date +%Y%m%d)



sleep 5

stress-ng --cpu 30 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 60s

sleep 60

blogbench -i 10 -d /netlogs | tee -a /home/amilis/$File.txt

This loaded all the CPUs totally at 100% and a lot of disk activity. This worked perfectly all night when the other halted, and the different here is the bridge.
I fond this blogbench yesterday, nice test-program to load a server. Like that!

Well.. have to dig in on bridge in vm-bhyve on 14.2. Not the same as in 14.1.

amilis · Friday at 2:52 PM

An update:

In vm switch list there was 5 bridges, all connected to tap1 and tap3 witch go to the two servers. No idea were bridge1|2|3|4 come from, and I don’t know if that is the problem. We will see.

 bridge0  standard  vm-bridge0  -        no       -     -     bridge0

bridge1  standard  -           -        no       -     -     -

bridge2  standard  -           -        no       -     -     -

bridge3  standard  -           -        no       -     -     -

bridge4  standard  -           -        no       -     -     -

Tried to destroy the bridges, didn’t work. So I had to manually edit /vm/.config/system.conf and remove them there. See 0300 if that worked.

amilis · Saturday at 10:07 AM

Ok, have to do an update here.

No problem today so far! So I took the morning to check the logs, nothing special.
I was looking for something with bridge and remembered the “daily run output” mail as there is “Network interface status” witch list every interfaces including the bridge (as connected to vm-bhyve switch). As they randomly come between 0300-0430 I checked the time stamp when this mail arrived in the mail server every day. And a mash... What a coincidence..

When the servers halted, it was the same time every day as this daily run output run. So my suspicion so far is that this script is doing something (new? In 14.2 – it worked in 14.1) with the interfaces, which caused some problem with the bridge. But the source problem may lies in vm-bhyve (14.2 update?) as there was this bridge1 to 4 (see above).

So if you have the same problem or similar, here is a hint.
We will see in few days if it’s stable.

amilis · 2025-01-08T10:28:35+0000

Ok.. this is annoying!

I disabled this one:

Code:

daily_status_network_enable="NO"     
daily_status_network_usedns="NO"    
#daily_status_network_netstat_flags="-d -W"

And run periodic daily manually. First everything looked good. I go mail without the network statistic, but directly after that, the server crashed/halted. The ssh-connecton froze and I was looked out. Tried vm console server, didn’t work. Had to do a power off/power on the vm.

I disable everything in periodic daily and run periodic daily manually, no problem. I remover the # from a row and periodic daily manually to the end of the file.. no problem! I change the cron time from 0300 to 0130, and the first day everything worked. But to day I see a halt again.

Question 1: Is there any way I can see/track/log whats happening in periodic (whats running)? So I can see were it crash/halt and remove that? There is no logs.
Question 2: What can I disable in periodic and have the server stable?

It’s av VM in vm-bhyve with ZFS as I need full disk encryption also on the VM it self.
It’s really hard to troubleshoot this one.