Solved Excessive disk i/o consumption (pkg)

tldr; on a heavily loaded system the daily running of neggrpperm by periodic on multiple jails generated excessive disc i/o and brought the host system to a crawl. Disabling neggrpperm on the jails and restricting its run on the host to monthly resolved the issue.


OP

Recently (the last two or three months) one of our server host has been reporting excessive disk i/o on a recurring basis:
Code:
# gstat -I5s | sort -rn -k9 | head  ### show disc i/o busy
   12     97      9     36   88.7     80   1526   48.0   95.9  ada3p3
   12     97      9     36   88.7     80   1526   48.0   95.9  ada3
   11     94      6     25   77.6     80   1576   41.9   93.0  ada2p3
   11     94      6     25   77.6     80   1576   41.9   93.0  ada2
    7     96      7     28   61.7     81   1613   38.3   83.9  ada1p3
    7     96      7     28   61.7     81   1613   38.3   83.9  ada1
    4     97      8     30   65.6     82   1626   38.9   83.7  ada0p3
    4     97      8     30   65.6     82   1626   38.9   83.7  ada0
dT: 5.019s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
This seems to be related to this process:
Code:
root       40688   0.0  0.0   18400    8592  -  DJ   08:09         0:18.56 find -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( ( ! -perm +010 -and -perm +001 ) -or ( ! -perm +020 -and -perm +002 ) -or ( ! -perm +040 -and -perm +004 ) ) -exec ls -liTd {} +

This is running in a FreeBSD-13.2 jail that hosts our Samba AD-DC.

Searching for this the references I have found suggest that the underlying issue is with periodic scripts having to do with pkg. Searching for the parent shows this:
Code:
 ps -o ppid= -p 40688
40328
 ps -auwwx | grep 40328
root       40328   0.0  0.0   13580    2440  -  IJ   08:09         0:00.00 /bin/sh - /etc/periodic/security/110.neggrpperm

Finding and killing every 110.neggrpperm immediately dropped the disk i/o load somewhat but not by much.

However, various pkg processes respawn thereafter. It seem that the issue lies with the number of jails running on that host. We are in the process of migrating to 14.1 and while we resolve issues with various packages on the converted jails we are keeping the services active on the host experiencing the i/o issue. My question is really: how much of the pkg activity on the jails is really necessary?

man periodic says that periodic jobs are scheduled in root crontab but I cannot find any such entries therein. Where are the schedules for the periodic runs kept?
 
Last edited:
Hmm... On 14.1 there is another issue with periodic working of pkg while it verifying checksums. Server with jails can become unavailable. Putting security_status_pkg_checksum_enable="NO" into /usr/local/etc/periodic.conf solves problem. /etc/defaults/periodic.conf stores all possible parameters.
 
Server with jails can become unavailable.
If you have a bunch of jails, stagger the exact timing of periodic(8) a bit. So they don't all run the periodic scripts at the exact same time and overwhelm the host.

man periodic says that periodic jobs are scheduled in root crontab but I cannot find any such entries therein.
/etc/crontab:
Code:
# Perform daily/weekly/monthly maintenance.
1 3 * * * root  periodic daily
15  4 * * 6 root  periodic weekly
30  5 1 * * root  periodic monthly
 
Actually, that has already been done, whether by myself at some point so long ago that I forget or by the distro itself. In any case /etc/crontab in the jails looks similar to this (the days and hours also vary between jails):
Code:
# Perform daily/weekly/monthly maintenance.
1   3   * * *     root  sleep $(jot -r 1 30 300) ; periodic daily
15  23  * * 5     root  sleep $(jot -r 1 30 300) ; periodic weekly
30  5  15 * *     root  sleep $(jot -r 1 30 300) ; periodic monthly
#

The real problem is that I had to move too much stuff off of one host to the the other to accommodate the extreme increase in upgrade time when moving a jail from 13.2 to 14.1. I am try to get this balanced again. Everything is supposed to move onto new hardware but the vendor shipped units with half the ordered memory and cpus. So that is delayed.
 
You may use idprio(1) to run periodic from /etc/crontab
This won't "slow down" periodic to make it more "friendly" to disks. It only puts periodic at the bottom of the dispatch queue. If nothing else is running the idprio processes will take whatever resources they need.
 
This won't "slow down" periodic to make it more "friendly" to disks. It only puts periodic at the bottom of the dispatch queue. If nothing else is running the idprio processes will take whatever resources they need.
I didn't try with periodic but it was effective when I had to run poudriere(8) on my desktop computer. The desktop was stalled for moments while intensive disk i/o. idprio solved this. The filesystem used was UFS.
 
Should /etc/crontab look somewhat like this? What values other than 31 are appropriate?
Code:
# Perform daily/weekly/monthly maintenance.
1   3   * * *     root  sleep $(jot -r 1 30 300) ; idprio 31 periodic daily
15  23  * * 5     root  sleep $(jot -r 1 30 300) ; idprio 31 periodic weekly
30  5  15 * *     root  sleep $(jot -r 1 30 300) ; idprio 31 periodic monthly
#
 
In each of my jails /etc/rc.conf I have cron_flags="-J 60" but your /etc/crontab seems OK.

idprio(1) says Priority can be between 0 and RTP_PRIO_MAX (usually 31). I usually use 31 but you can decrease it. You can start with 31 to see if you obtain immediate results.
Is your system is running with UFS filesystem ? It works for me with UFS but if I remember not so well with ZFS.
 
Everything here is root on zfs.

In addition to the other changes I simply moved neggrpperm from daily to monthly. Getting rid of it has reduced peak i/o from ~ 98% busy to ~52%. This is enough of an improvement that I can get back to the task of moving things of that host.

None of the jails on that host nor the host itself have user accounts so the risk resulting from reducing the frequency is minimal.

Actually, the value of running neggrpperm at all on the jails seems suspect. The script that runs on the host checks the same directories in every jail.
 
The i/o load on that host now looks like this:
Code:
# gstat -I5s | sort -rn -k9 | head  ### show disc i/o busy
    0     39      0      0    0.0     35    776    1.8    6.1  ada1p3
    0     39      0      0    0.0     35    776    1.8    6.1  ada1
    0     40      0      0    0.0     36    793    1.5    5.9  ada3p3
    0     40      0      0    0.0     36    793    1.5    5.9  ada3
    0     41      0      0    0.0     37    786    1.7    4.7  ada0p3
    0     41      0      0    0.0     37    786    1.7    4.7  ada0
    0     39      0      0    0.0     35    777    1.5    4.7  ada2p3
    0     39      0      0    0.0     35    777    1.5    4.7  ada2
dT: 5.062s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name

evidently neggrpperm is a very, very, io thirsty script.
 
Back
Top