looking for the reason for sudden very high CPU utilization on network/zfs related process

First a bit of background:

This is on a FreeBSD14 relatively older machine (CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz (2593.86-MHz K8-class CPU), with a 4 port Gbit ethernet interface (bge1: <HP Ethernet 1Gb 4-port 331FLR Adapter) and 384GB of RAM.
a few ZFS pools set on 4x 8TB SAS drives in a raidz vdev, hooked to an LSI controller (mps0: <Avago Technologies (LSI) SAS2008> ) not running any hardware raid, direct JBOD mode.

I run the client for the Storj network, within a jail. The jail has one of the interfaces dedicated to it. Gateway mode is not enabled, so not routing between interfaces.

In that same machine, a zabbix server runs in his own jail, on his own dedicated interface, with a mysql instance.
A zabbix agent also runs on the host. That agent runs on a pool with a mirrored of 2 300GB drives, hanging off a HP P420i HBA controller set to JBOD mode (no hardware raid, I describe the process here)

Yes, I know some things are not optimal, I should run the zabbix server on a separate machine, at least right now it uses the zroot pool so it should not interfere with the storagenode process.

How it was running until a week ago:

Very low cpu use, top would consistently show 2-6% WCPU on the Storj (called storagenode) process. cpu ~1-2%, user ~1-2%, mostly idle, even when the interface showed 150Mb/s incoming bandwidth and zfs showed about 15MB/s writes. Perfectly normal. This went on for weeks without problem.
I could copy files locally with rsync and hit 130MB/s write on the (tank) zfs pool and see no performance hit at all.

How it runs now:

spikes of cpu load lasting 3 hours or more, during which the storagenode process goes to 600%-1200% WCPU (according to top), overall load going up a lot as reported in zabbix, overall sluggish behavior (terminal over ssh interactio is slow, like man pages take a couple seconds to show any text)
1719482194853.png


during these high load times, network bandwidth actually goes down (I attribute that to the fact that the cpu load goes up and makes the storagenode too slow to up/download data in time, therefore it aborts many transfers as seen in its logs), disk activity also goes down accordingly


1719482319600.png


Of course, it could be badly designed or unoptimized software however what puzzles me and makes me think it might be system related is this output from top while hitting those high cpu load

CPU: 1.2% user, 0.0% nice, 76.9% system, 0.0% interrupt, 21.9% idle Mem: 1324M Active, 301G Inact, 1395M Laundry, 62G Wired, 225M Buf, 8219M Free ARC: 36G Total, 13G MFU, 9566M MRU, 37M Anon, 526M Header, 14G Other 8871M Compressed, 46G Uncompressed, 5.27:1 Ratio Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 7390 storagenod 615 68 0 2975M 1138M uwait 9 131.4H 1331.80% storagenode 63355 88 79 20 0 3673M 956M select 11 63.8H 104.52% mysqld 73892 root 1 124 0 51M 35M CPU9 9 42.4H 99.43% find 12195 storagenod 26 99 9 1270M 80M uwait 22 71:46 94.05% storagenode 92235 storagenod 27 98 9 1272M 31M uwait 7 720:18 72.07% storagenode 25940 hostd 35 68 0 54G 3007M uwait 8 34.1H 47.85% hostd 12947 zabbix 1 20 0 132M 17M kqread 7 29:11 1.28% zabbix_server 66512 bruno 1 28 0 14M 3692K CPU0 0 0:01 0.27% top 7389 root 1 20 0 13M 1656K kqread 22 4:42 0.17% daemon 12993 zabbix 2 20 0 192M 13M kqread 22 3:44 0.04% zabbix_server 12952 zabbix 4 20 0 131M 18M kqread 22 9:04 0.03% zabbix_server 12965 zabbix 1 20 0 128M 20M sbwait 14 2:01 0.02% zabbix_server 11722 zabbix 1 4 0 30M 5460K RUN 22 1:07 0.02% zabbix_agentd 26509 zabbix 1 20 0 27M 6232K nanslp 18 2:18 0.01% zabbix_agentd 11720 zabbix 1 20 0 27M 5472K nanslp 10 1:54 0.01% zabbix_agentd

76.9% system ? so that means whatever is using up all cpu is actually waiting on some system processes ?
here is top in iomode (-m)
last pid: 90993; load averages: 25.82, 20.82, 18.69 up 12+07:54:16 17:56:26 195 processes: 2 running, 193 sleeping CPU: 2.8% user, 0.0% nice, 61.9% system, 0.0% interrupt, 35.3% idle Mem: 1378M Active, 304G Inact, 1423M Laundry, 59G Wired, 229M Buf, 8304M Free ARC: 33G Total, 13G MFU, 6385M MRU, 26M Anon, 396M Header, 13G Other 5425M Compressed, 23G Uncompressed, 4.26:1 Ratio Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse PID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND 7390 storagenod 11479 1648 76 364 0 440 42.07% storagenode 63355 88 3790138 142 4 98 0 102 9.75% mysqld 73892 root 198 333 103 0 0 103 9.85% find 92235 storagenod 393 175 5 0 0 5 0.48% storagenode 25940 hostd 1255 73 2 243 0 245 23.42% hostd 12952 zabbix 62 12 0 0 0 0 0.00% zabbix_server 12965 zabbix 21 0 0 0 0 0 0.00% zabbix_server 7389 root 145 0 0 150 0 150 14.34% daemon 88576 bruno 2 1 0 0 0 0 0.00% top 11729 zabbix 6 0 0 0 0 0 0.00% zabbix_agentd 12992 zabbix 16 0 0 0 0 0 0.00% zabbix_server 12993 zabbix 12 0 0 0 0 0 0.00% zabbix_server 12962 zabbix 12 0 0 0 0 0 0.00% zabbix_server 12954 zabbix 31 0 0 0 0 0 0.00% zabbix_server 12944 zabbix 19 0 0 0 0 0 0.00% zabbix_server 12963 zabbix 22 0 0 0 0 0 0.00% zabbix_server 12983 zabbix 12 0 0 0 0 0 0.00% zabbix_server

ioztat output does not show anything abnormal to me

operations throughput opsize dataset read write read write read write ----------------------- ----- ----- ----- ----- ----- ----- frontpool 0 0 0 0 0 0 data 0 0 0 0 0 0 tank 0 0 0 0 0 0 hostd 0 0 0 0 0 0 data 0 0 0 2.67M 0 4M lib 268 28 1.05M 57.7K 4K 2.01K storj 1 105 19.8K 3.89M 11.9K 38.0K databases 0 0 0 0 0 0 zroot 0 0 0 0 0 0 ROOT default 0 0 0 0 0 0 appjail 0 0 0 0 0 0 components 0 0 0 0 0 0 amd64 0 0 0 0 0 0 14.0-RELEASE 0 0 0 0 0 0 default 0 0 0 0 0 0 jails 0 0 0 0 0 0 hostd 0 0 0 0 0 0 jail 0 0 0 0 0 0 storj 0 0 0 0 0 0 jail 0 49 0 14.5K 0 302 zabbix 0 0 0 0 0 0 jail 107 90 1.12M 2.02M 10.6K 22.9K logs 0 0 0 0 0 0 jails 0 0 0 0 0 0 hostd 0 0 0 0 0 0 console 0 0 0 0 0 0 storj 0 0 0 0 0 0 console 0 0 0 0 0 0 startup-start 0 0 0 0 0 0 startup-stop 0 0 0 0 0 0 testjail 0 0 0 0 0 0 console 0 0 0 0 0 0 zabbix 0 0 0 0 0 0 console 0 0 0 0 0 0 networks 0 0 0 0 0 0 p5jails 0 0 0 0 0 0 releases 0 0 0 0 0 0 amd64 0 0 0 0 0 0 14.0-RELEASE 0 0 0 0 0 0 default 0 0 0 0 0 0 release 0 0 0 0 0 0 home 0 0 0 0 0 0 tmp 0 0 0 0 0 0 usr ports 0 0 0 0 0 0 src 0 0 0 0 0 0 var audit 0 0 0 0 0 0 crash 0 0 0 0 0 0 log 0 0 0 0 0 0 mail 0 0 0 0 0 0 tmp 0 0 0 0 0 0

So, before I pull out the profiler, instrument the storagenode process to find what it is doing internally, I'd like to make sure it's not something I could change on the system, which to me sounds probable.

Any hints and advice on how to track this down ? What to look for, ideas to test ?

Thanks for any help
 
You can also trace the syscalls that `storagenode` is making with `trusss -p`

Overall I am looking at those `find` jobs. Are they present and active when the load is normal?
 
Overall I am looking at those `find` jobs. Are they present and active when the load is normal?
Exactly. Find can be great at bogging down IO (and thereby impact many things) if it is trawling a large directory. There are some fairly standard periodic scripts that use it, as I recall.

ps -Ad can be useful to see process trees, if you need to look at “who launched that”.
 
thanks for the suggestions, I am still trying to track this down.
I noticed that in "top", the storagenode process (when CPU pegged) always shows a state of "uwait"

I tried truss and it locked up the process and had to restart the process. Then it just went for 12 hours at low cpu as it used to, then slowly climbed back for 3 hours, until it just exploded to high cpu load again.

I found this and was going to use dtruss but it seems to use a probe that does not exist in FreeBSD14 anymore. (probe description syscall::lstat*:return does not match any probes) . thinking of changing this, but dtrace -l | grep lstat shows only compat*.lstat probes ... what would be the equivalent for FBSD 14 ?
BTW, it is too bad the dtrace toolkit is not maintained anymore, there are lots of good tools in there.

Overall I am looking at those `find` jobs. Are they present and active when the load is normal?
not sure, I did not pay attention to it before.
that find command is rather strange, this is what ps -l -p 73892 -ww outputs:

UID PID PPID C PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND 0 73892 73887 15 68 0 52692 32772 zio->io_ D - 4241:22.69 / /usr/local/appjail /frontpool /var/mail /usr/src /home /mnt/data1 /zroot / usr/local/appjail/cache/components /usr/local/appjail/releases /usr/local/appjail/networks /usr/local/appjail/jails /var/log/appjail /usr/local /appjail/networks/p5jails /usr/local/appjail/releases/amd64 /usr/local/appjail/cache/components/amd64 /var/log/appjail/jails /usr/local/appjail /cache/components/amd64/14.0-RELEASE /usr/local/appjail/jails/storj /var/log/appjail/jails/storj /usr/local/appjail/cache/components/amd64/14.0 -RELEASE/default /var/log/appjail/jails/testjail /usr/local/appjail/releases/amd64/14.0-RELEASE /var/log/appjail/jails/storj/console /var/log/a ppjail/jails/storj/startup-stop /usr/local/appjail/jails/storj/jail /usr/local/appjail/releases/amd64/14.0-RELEASE/default /var/log/appjail/jai ls/testjail/console /usr/local/appjail/releases/amd64/14.0-RELEASE/default/release /var/log/appjail/jails/storj/startup-start /tank /tank/storj /tank/hostd /usr/local/appjail/jails/hostd /usr/local/appjail/jails/hostd/jail /var/log/appjail/jails/hostd /var/log/appjail/jails/hostd/conso le /tank/hostd/data /tank/hostd/lib /var/log/appjail/jails/zabbix /var/log/appjail/jails/zabbix/console /usr/local/appjail/jails/zabbix /usr/lo cal/appjail/jails/zabbix/jail /dev/null (find)

I see in the process tree it is coming from cron, part of the periodic daily security scripts. Looks like those scripts traverse all pools and all files ??
(/bin/sh - /etc/periodic/security/100.chksetuid)
 
ok so that find was what triggered the high cpu load..
the daily periodic security check would traverse all filesystems and peg the filesystems. That specific instance had been going for 2 days (for a daily hahaha)
Killing that process reverts cpu usage to normal.

With that said, I do not view this as the root cause. I think the storagenode process has a bug, because it seems to busy wait when starved from IO.
I'll dig a bit more into that and report it.

Thanks everyone for the hints and tips !
 
Check your zpool iostat. Look how many i/o operations per seconds. For Business critical 3.5'' 7200 SAS disk the upper limit is around 140-150iops for 8k per cmd with 4q anything above that and the disk just can't keep up. For the mission critical 2.5'' sas disk the iops limit is around 200-250 for 10k rpm and 300iops for 15k rpm
 
ok well no victory lap yet.
although disabling those security scripts had an immediate positive effect, turns out the storagenode process also creates a process which does exactly what the scripts did, traverse all the filesystem to do some cleanup. This can't be disabled. Clearly the root cause should be tackled. At least I am pretty confident to have identified the real problem and it can probably be fixed.
 
For the periodic scans, at least some of them skip mounts with, for example, noexec or nosuid flags. Consider setting these on large “data” filesystems when appropriate to cut down the load.
 
Check your zpool iostat. Look how many i/o operations per seconds. For Business critical 3.5'' 7200 SAS disk the upper limit is around 140-150iops for 8k per cmd with 4q anything above that and the disk just can't keep up. For the mission critical 2.5'' sas disk the iops limit is around 200-250 for 10k rpm and 300iops for 15k rpm

they are 3.5" SAS 7200rpm drives

bruno@proliant5:~ $ zpool iostat -lvy capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim rebuild pool alloc free read write read write read write read write read write read write wait wait wait ---------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- tank 20.7T 8.46T 527 0 8.25M 0 18ms - 18ms - 3us - - - - - - raidz1-0 20.7T 8.46T 792 0 9.29M 0 16ms - 16ms - 2us - - - - - - gpt/slot0-7SGV590G - - 264 0 1.03M 0 12ms - 12ms - 768ns - - - - - - gpt/slot2-7SG1A8RR - - 265 0 1.04M 0 25ms - 25ms - 6us - - - - - - gpt/slot4-7SG1APWR - - 265 0 7.26M 0 12ms - 12ms - 384ns - - - - - - gpt/slot6-7SG19LRR - - 0 0 0 0 - - - - - - - - - - - ---------------------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----

I checked before that storagenode does a lot of 16KB operations.
 
For the periodic scans, at least some of them skip mounts with, for example, noexec or nosuid flags. Consider setting these on large “data” filesystems when appropriate to cut down the load.
sounds good, how are they set on zfs datasets ?
ah I see it's zfs properties
 
Try gstat -pdo. This will show disk-level stats, including busy %.

Traversing filesystems for cleanup sounds like an occasional task. There may be tunable for how aggressive it is during cleanup; otherwise, if it just tries to go “as fast as possible”, it should be no surprise that it impacts other tasks.
 
Back
Top