Solved Excessive disk i/o consumption (pkg)

byrnejb · Sep 23, 2024

tldr; on a heavily loaded system the daily running of neggrpperm by periodic on multiple jails generated excessive disc i/o and brought the host system to a crawl. Disabling neggrpperm on the jails and restricting its run on the host to monthly resolved the issue.

OP

Recently (the last two or three months) one of our server host has been reporting excessive disk i/o on a recurring basis:

Code:

# gstat -I5s | sort -rn -k9 | head  ### show disc i/o busy
   12     97      9     36   88.7     80   1526   48.0   95.9  ada3p3
   12     97      9     36   88.7     80   1526   48.0   95.9  ada3
   11     94      6     25   77.6     80   1576   41.9   93.0  ada2p3
   11     94      6     25   77.6     80   1576   41.9   93.0  ada2
    7     96      7     28   61.7     81   1613   38.3   83.9  ada1p3
    7     96      7     28   61.7     81   1613   38.3   83.9  ada1
    4     97      8     30   65.6     82   1626   38.9   83.7  ada0p3
    4     97      8     30   65.6     82   1626   38.9   83.7  ada0
dT: 5.019s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name

This seems to be related to this process:

Code:

root       40688   0.0  0.0   18400    8592  -  DJ   08:09         0:18.56 find -sx / /dev/null ( ! -fstype local ) -prune -o -type f ( ( ! -perm +010 -and -perm +001 ) -or ( ! -perm +020 -and -perm +002 ) -or ( ! -perm +040 -and -perm +004 ) ) -exec ls -liTd {} +

This is running in a FreeBSD-13.2 jail that hosts our Samba AD-DC.

Searching for this the references I have found suggest that the underlying issue is with periodic scripts having to do with pkg. Searching for the parent shows this:

Code:

 ps -o ppid= -p 40688
40328
 ps -auwwx | grep 40328
root       40328   0.0  0.0   13580    2440  -  IJ   08:09         0:00.00 /bin/sh - /etc/periodic/security/110.neggrpperm

Finding and killing every 110.neggrpperm immediately dropped the disk i/o load somewhat but not by much.

However, various pkg processes respawn thereafter. It seem that the issue lies with the number of jails running on that host. We are in the process of migrating to 14.1 and while we resolve issues with various packages on the converted jails we are keeping the services active on the host experiencing the i/o issue. My question is really: how much of the pkg activity on the jails is really necessary?

man periodic says that periodic jobs are scheduled in root crontab but I cannot find any such entries therein. Where are the schedules for the periodic runs kept?

covacat · Sep 23, 2024

grep periodic /etc/crontab

CeXP1917 · Sep 23, 2024

Hmm... On 14.1 there is another issue with periodic working of pkg while it verifying checksums. Server with jails can become unavailable. Putting security_status_pkg_checksum_enable="NO" into /usr/local/etc/periodic.conf solves problem. /etc/defaults/periodic.conf stores all possible parameters.

Charlie_ · Sep 23, 2024

110.neggrpperm can be stopped in periodic.conf(5).

SirDice · Sep 23, 2024

CeXP1917 said:
Server with jails can become unavailable.

If you have a bunch of jails, stagger the exact timing of periodic(8) a bit. So they don't all run the periodic scripts at the exact same time and overwhelm the host.

byrnejb said:
man periodic says that periodic jobs are scheduled in root crontab but I cannot find any such entries therein.

/etc/crontab:

Code:

# Perform daily/weekly/monthly maintenance.
1 3 * * * root  periodic daily
15  4 * * 6 root  periodic weekly
30  5 1 * * root  periodic monthly

byrnejb · Sep 24, 2024

Actually, that has already been done, whether by myself at some point so long ago that I forget or by the distro itself. In any case /etc/crontab in the jails looks similar to this (the days and hours also vary between jails):

Code:

# Perform daily/weekly/monthly maintenance.
1   3   * * *     root  sleep $(jot -r 1 30 300) ; periodic daily
15  23  * * 5     root  sleep $(jot -r 1 30 300) ; periodic weekly
30  5  15 * *     root  sleep $(jot -r 1 30 300) ; periodic monthly
#

The real problem is that I had to move too much stuff off of one host to the the other to accommodate the extreme increase in upgrade time when moving a jail from 13.2 to 14.1. I am try to get this balanced again. Everything is supposed to move onto new hardware but the vendor shipped units with half the ordered memory and cpus. So that is delayed.

facedebouc · Sep 24, 2024

You may use idprio(1) to run periodic from /etc/crontab

cy@ · Sep 24, 2024

facedebouc said:
You may use idprio(1) to run periodic from /etc/crontab

This won't "slow down" periodic to make it more "friendly" to disks. It only puts periodic at the bottom of the dispatch queue. If nothing else is running the idprio processes will take whatever resources they need.

facedebouc · Sep 24, 2024

cy@ said:
This won't "slow down" periodic to make it more "friendly" to disks. It only puts periodic at the bottom of the dispatch queue. If nothing else is running the idprio processes will take whatever resources they need.

I didn't try with periodic but it was effective when I had to run poudriere(8) on my desktop computer. The desktop was stalled for moments while intensive disk i/o. idprio solved this. The filesystem used was UFS.

facedebouc · Sep 24, 2024

To quote idprio(1):

To make depend while not disturbing other machine usage:
idprio 31 make depend

byrnejb · Sep 25, 2024

Should /etc/crontab look somewhat like this? What values other than 31 are appropriate?

Code:

# Perform daily/weekly/monthly maintenance.
1   3   * * *     root  sleep $(jot -r 1 30 300) ; idprio 31 periodic daily
15  23  * * 5     root  sleep $(jot -r 1 30 300) ; idprio 31 periodic weekly
30  5  15 * *     root  sleep $(jot -r 1 30 300) ; idprio 31 periodic monthly
#

facedebouc · Sep 25, 2024

In each of my jails /etc/rc.conf I have cron_flags="-J 60" but your /etc/crontab seems OK.

idprio(1) says Priority can be between 0 and RTP_PRIO_MAX (usually 31). I usually use 31 but you can decrease it. You can start with 31 to see if you obtain immediate results.
Is your system is running with UFS filesystem ? It works for me with UFS but if I remember not so well with ZFS.

byrnejb · Sep 25, 2024

Everything here is root on zfs.

In addition to the other changes I simply moved neggrpperm from daily to monthly. Getting rid of it has reduced peak i/o from ~ 98% busy to ~52%. This is enough of an improvement that I can get back to the task of moving things of that host.

None of the jails on that host nor the host itself have user accounts so the risk resulting from reducing the frequency is minimal.

Actually, the value of running neggrpperm at all on the jails seems suspect. The script that runs on the host checks the same directories in every jail.

byrnejb · Sep 25, 2024

The i/o load on that host now looks like this:

Code:

# gstat -I5s | sort -rn -k9 | head  ### show disc i/o busy
    0     39      0      0    0.0     35    776    1.8    6.1  ada1p3
    0     39      0      0    0.0     35    776    1.8    6.1  ada1
    0     40      0      0    0.0     36    793    1.5    5.9  ada3p3
    0     40      0      0    0.0     36    793    1.5    5.9  ada3
    0     41      0      0    0.0     37    786    1.7    4.7  ada0p3
    0     41      0      0    0.0     37    786    1.7    4.7  ada0
    0     39      0      0    0.0     35    777    1.5    4.7  ada2p3
    0     39      0      0    0.0     35    777    1.5    4.7  ada2
dT: 5.062s  w: 5.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name

evidently neggrpperm is a very, very, io thirsty script.

Solved Excessive disk i/o consumption (pkg)

Administrator