First a bit of background:
This is on a FreeBSD14 relatively older machine (CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz (2593.86-MHz K8-class CPU), with a 4 port Gbit ethernet interface (bge1: <HP Ethernet 1Gb 4-port 331FLR Adapter) and 384GB of RAM.
a few ZFS pools set on 4x 8TB SAS drives in a raidz vdev, hooked to an LSI controller (mps0: <Avago Technologies (LSI) SAS2008> ) not running any hardware raid, direct JBOD mode.
I run the client for the Storj network, within a jail. The jail has one of the interfaces dedicated to it. Gateway mode is not enabled, so not routing between interfaces.
In that same machine, a zabbix server runs in his own jail, on his own dedicated interface, with a mysql instance.
A zabbix agent also runs on the host. That agent runs on a pool with a mirrored of 2 300GB drives, hanging off a HP P420i HBA controller set to JBOD mode (no hardware raid, I describe the process here)
Yes, I know some things are not optimal, I should run the zabbix server on a separate machine, at least right now it uses the zroot pool so it should not interfere with the storagenode process.
How it was running until a week ago:
Very low cpu use, top would consistently show 2-6% WCPU on the Storj (called storagenode) process. cpu ~1-2%, user ~1-2%, mostly idle, even when the interface showed 150Mb/s incoming bandwidth and zfs showed about 15MB/s writes. Perfectly normal. This went on for weeks without problem.
I could copy files locally with rsync and hit 130MB/s write on the (tank) zfs pool and see no performance hit at all.
How it runs now:
spikes of cpu load lasting 3 hours or more, during which the storagenode process goes to 600%-1200% WCPU (according to top), overall load going up a lot as reported in zabbix, overall sluggish behavior (terminal over ssh interactio is slow, like man pages take a couple seconds to show any text)
during these high load times, network bandwidth actually goes down (I attribute that to the fact that the cpu load goes up and makes the storagenode too slow to up/download data in time, therefore it aborts many transfers as seen in its logs), disk activity also goes down accordingly
Of course, it could be badly designed or unoptimized software however what puzzles me and makes me think it might be system related is this output from top while hitting those high cpu load
76.9% system ? so that means whatever is using up all cpu is actually waiting on some system processes ?
here is top in iomode (-m)
ioztat output does not show anything abnormal to me
So, before I pull out the profiler, instrument the storagenode process to find what it is doing internally, I'd like to make sure it's not something I could change on the system, which to me sounds probable.
Any hints and advice on how to track this down ? What to look for, ideas to test ?
Thanks for any help
This is on a FreeBSD14 relatively older machine (CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz (2593.86-MHz K8-class CPU), with a 4 port Gbit ethernet interface (bge1: <HP Ethernet 1Gb 4-port 331FLR Adapter) and 384GB of RAM.
a few ZFS pools set on 4x 8TB SAS drives in a raidz vdev, hooked to an LSI controller (mps0: <Avago Technologies (LSI) SAS2008> ) not running any hardware raid, direct JBOD mode.
I run the client for the Storj network, within a jail. The jail has one of the interfaces dedicated to it. Gateway mode is not enabled, so not routing between interfaces.
In that same machine, a zabbix server runs in his own jail, on his own dedicated interface, with a mysql instance.
A zabbix agent also runs on the host. That agent runs on a pool with a mirrored of 2 300GB drives, hanging off a HP P420i HBA controller set to JBOD mode (no hardware raid, I describe the process here)
Yes, I know some things are not optimal, I should run the zabbix server on a separate machine, at least right now it uses the zroot pool so it should not interfere with the storagenode process.
How it was running until a week ago:
Very low cpu use, top would consistently show 2-6% WCPU on the Storj (called storagenode) process. cpu ~1-2%, user ~1-2%, mostly idle, even when the interface showed 150Mb/s incoming bandwidth and zfs showed about 15MB/s writes. Perfectly normal. This went on for weeks without problem.
I could copy files locally with rsync and hit 130MB/s write on the (tank) zfs pool and see no performance hit at all.
How it runs now:
spikes of cpu load lasting 3 hours or more, during which the storagenode process goes to 600%-1200% WCPU (according to top), overall load going up a lot as reported in zabbix, overall sluggish behavior (terminal over ssh interactio is slow, like man pages take a couple seconds to show any text)
during these high load times, network bandwidth actually goes down (I attribute that to the fact that the cpu load goes up and makes the storagenode too slow to up/download data in time, therefore it aborts many transfers as seen in its logs), disk activity also goes down accordingly
Of course, it could be badly designed or unoptimized software however what puzzles me and makes me think it might be system related is this output from top while hitting those high cpu load
CPU: 1.2% user, 0.0% nice, 76.9% system, 0.0% interrupt, 21.9% idle
Mem: 1324M Active, 301G Inact, 1395M Laundry, 62G Wired, 225M Buf, 8219M Free
ARC: 36G Total, 13G MFU, 9566M MRU, 37M Anon, 526M Header, 14G Other
8871M Compressed, 46G Uncompressed, 5.27:1 Ratio
Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
7390 storagenod 615 68 0 2975M 1138M uwait 9 131.4H 1331.80% storagenode
63355 88 79 20 0 3673M 956M select 11 63.8H 104.52% mysqld
73892 root 1 124 0 51M 35M CPU9 9 42.4H 99.43% find
12195 storagenod 26 99 9 1270M 80M uwait 22 71:46 94.05% storagenode
92235 storagenod 27 98 9 1272M 31M uwait 7 720:18 72.07% storagenode
25940 hostd 35 68 0 54G 3007M uwait 8 34.1H 47.85% hostd
12947 zabbix 1 20 0 132M 17M kqread 7 29:11 1.28% zabbix_server
66512 bruno 1 28 0 14M 3692K CPU0 0 0:01 0.27% top
7389 root 1 20 0 13M 1656K kqread 22 4:42 0.17% daemon
12993 zabbix 2 20 0 192M 13M kqread 22 3:44 0.04% zabbix_server
12952 zabbix 4 20 0 131M 18M kqread 22 9:04 0.03% zabbix_server
12965 zabbix 1 20 0 128M 20M sbwait 14 2:01 0.02% zabbix_server
11722 zabbix 1 4 0 30M 5460K RUN 22 1:07 0.02% zabbix_agentd
26509 zabbix 1 20 0 27M 6232K nanslp 18 2:18 0.01% zabbix_agentd
11720 zabbix 1 20 0 27M 5472K nanslp 10 1:54 0.01% zabbix_agentd
76.9% system ? so that means whatever is using up all cpu is actually waiting on some system processes ?
here is top in iomode (-m)
last pid: 90993; load averages: 25.82, 20.82, 18.69 up 12+07:54:16 17:56:26
195 processes: 2 running, 193 sleeping
CPU: 2.8% user, 0.0% nice, 61.9% system, 0.0% interrupt, 35.3% idle
Mem: 1378M Active, 304G Inact, 1423M Laundry, 59G Wired, 229M Buf, 8304M Free
ARC: 33G Total, 13G MFU, 6385M MRU, 26M Anon, 396M Header, 13G Other
5425M Compressed, 23G Uncompressed, 4.26:1 Ratio
Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse
PID USERNAME VCSW IVCSW READ WRITE FAULT TOTAL PERCENT COMMAND
7390 storagenod 11479 1648 76 364 0 440 42.07% storagenode
63355 88 3790138 142 4 98 0 102 9.75% mysqld
73892 root 198 333 103 0 0 103 9.85% find
92235 storagenod 393 175 5 0 0 5 0.48% storagenode
25940 hostd 1255 73 2 243 0 245 23.42% hostd
12952 zabbix 62 12 0 0 0 0 0.00% zabbix_server
12965 zabbix 21 0 0 0 0 0 0.00% zabbix_server
7389 root 145 0 0 150 0 150 14.34% daemon
88576 bruno 2 1 0 0 0 0 0.00% top
11729 zabbix 6 0 0 0 0 0 0.00% zabbix_agentd
12992 zabbix 16 0 0 0 0 0 0.00% zabbix_server
12993 zabbix 12 0 0 0 0 0 0.00% zabbix_server
12962 zabbix 12 0 0 0 0 0 0.00% zabbix_server
12954 zabbix 31 0 0 0 0 0 0.00% zabbix_server
12944 zabbix 19 0 0 0 0 0 0.00% zabbix_server
12963 zabbix 22 0 0 0 0 0 0.00% zabbix_server
12983 zabbix 12 0 0 0 0 0 0.00% zabbix_server
ioztat output does not show anything abnormal to me
operations throughput opsize
dataset read write read write read write
----------------------- ----- ----- ----- ----- ----- -----
frontpool 0 0 0 0 0 0
data 0 0 0 0 0 0
tank 0 0 0 0 0 0
hostd 0 0 0 0 0 0
data 0 0 0 2.67M 0 4M
lib 268 28 1.05M 57.7K 4K 2.01K
storj 1 105 19.8K 3.89M 11.9K 38.0K
databases 0 0 0 0 0 0
zroot 0 0 0 0 0 0
ROOT
default 0 0 0 0 0 0
appjail 0 0 0 0 0 0
components 0 0 0 0 0 0
amd64 0 0 0 0 0 0
14.0-RELEASE 0 0 0 0 0 0
default 0 0 0 0 0 0
jails 0 0 0 0 0 0
hostd 0 0 0 0 0 0
jail 0 0 0 0 0 0
storj 0 0 0 0 0 0
jail 0 49 0 14.5K 0 302
zabbix 0 0 0 0 0 0
jail 107 90 1.12M 2.02M 10.6K 22.9K
logs 0 0 0 0 0 0
jails 0 0 0 0 0 0
hostd 0 0 0 0 0 0
console 0 0 0 0 0 0
storj 0 0 0 0 0 0
console 0 0 0 0 0 0
startup-start 0 0 0 0 0 0
startup-stop 0 0 0 0 0 0
testjail 0 0 0 0 0 0
console 0 0 0 0 0 0
zabbix 0 0 0 0 0 0
console 0 0 0 0 0 0
networks 0 0 0 0 0 0
p5jails 0 0 0 0 0 0
releases 0 0 0 0 0 0
amd64 0 0 0 0 0 0
14.0-RELEASE 0 0 0 0 0 0
default 0 0 0 0 0 0
release 0 0 0 0 0 0
home 0 0 0 0 0 0
tmp 0 0 0 0 0 0
usr
ports 0 0 0 0 0 0
src 0 0 0 0 0 0
var
audit 0 0 0 0 0 0
crash 0 0 0 0 0 0
log 0 0 0 0 0 0
mail 0 0 0 0 0 0
tmp 0 0 0 0 0 0
So, before I pull out the profiler, instrument the storagenode process to find what it is doing internally, I'd like to make sure it's not something I could change on the system, which to me sounds probable.
Any hints and advice on how to track this down ? What to look for, ideas to test ?
Thanks for any help