General system slowness

I am experiencing generally slow system on my arm64 (rk3399) box - slow opening of new tmux pane, slow opening new zsh instance, slow startup of vim, slow opening files in vim. All config files are the same as on my 10 years old amd64 laptop which works as expected. Same arm64 hardware with OpenBSD also feels slow, but with Linux feels much more responsive.

Tests:
Code:
for i in $(seq 1 10) ; do {time zsh -i -c "sleep 0.1 ; exit"} 2>&1 ; done
for i in $(seq 1 10) ; do dd if=/dev/nvme0n1 of=/dev/null bs=1M count=4K conv=sync 2>&1 | tail -1 ; done
for i in $(seq 1 10) ; do fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1 | tail -1 ; done
for i in $(seq 1 10) ; do fio --randrepeat=1 --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4M --iodepth=256 --size=10G --readwrite=read --ramp_time=4 | tail -1 ; done

arm64 FreeBSD:
Code:
running test - zsh
--------------------------------------------------------------------------------
zsh -i -c "sleep 0.1 ; exit"  0.74s user 0.31s system 91% cpu 1.144 total
zsh -i -c "sleep 0.1 ; exit"  0.75s user 0.25s system 90% cpu 1.102 total
zsh -i -c "sleep 0.1 ; exit"  0.67s user 0.32s system 90% cpu 1.098 total
zsh -i -c "sleep 0.1 ; exit"  0.48s user 0.28s system 87% cpu 0.861 total
zsh -i -c "sleep 0.1 ; exit"  0.69s user 0.31s system 90% cpu 1.099 total
zsh -i -c "sleep 0.1 ; exit"  0.65s user 0.36s system 91% cpu 1.107 total
zsh -i -c "sleep 0.1 ; exit"  0.64s user 0.24s system 89% cpu 0.973 total
zsh -i -c "sleep 0.1 ; exit"  0.64s user 0.32s system 90% cpu 1.064 total
zsh -i -c "sleep 0.1 ; exit"  0.75s user 0.25s system 90% cpu 1.107 total
zsh -i -c "sleep 0.1 ; exit"  0.55s user 0.27s system 88% cpu 0.914 total

running test - dd
--------------------------------------------------------------------------------
4294967296 bytes transferred in 10.447751 secs (411090110 bytes/sec)
4294967296 bytes transferred in 10.872883 secs (395016405 bytes/sec)
4294967296 bytes transferred in 10.884630 secs (394590117 bytes/sec)
4294967296 bytes transferred in 10.874099 secs (394972228 bytes/sec)
4294967296 bytes transferred in 10.893587 secs (394265665 bytes/sec)
4294967296 bytes transferred in 10.703562 secs (401265232 bytes/sec)
4294967296 bytes transferred in 10.881222 secs (394713684 bytes/sec)
4294967296 bytes transferred in 10.884584 secs (394591765 bytes/sec)
4294967296 bytes transferred in 10.693615 secs (401638464 bytes/sec)
4294967296 bytes transferred in 10.702620 secs (401300535 bytes/sec)

running test - fio1
--------------------------------------------------------------------------------
  WRITE: bw=2931KiB/s (3001kB/s), 2931KiB/s-2931KiB/s (3001kB/s-3001kB/s), io=172MiB (180MB), run=60022-60022msec
  WRITE: bw=2829KiB/s (2897kB/s), 2829KiB/s-2829KiB/s (2897kB/s-2897kB/s), io=166MiB (174MB), run=60084-60084msec
  WRITE: bw=2902KiB/s (2971kB/s), 2902KiB/s-2902KiB/s (2971kB/s-2971kB/s), io=170MiB (178MB), run=60024-60024msec
  WRITE: bw=2875KiB/s (2944kB/s), 2875KiB/s-2875KiB/s (2944kB/s-2944kB/s), io=168MiB (177MB), run=60009-60009msec
  WRITE: bw=2915KiB/s (2985kB/s), 2915KiB/s-2915KiB/s (2985kB/s-2985kB/s), io=171MiB (179MB), run=60029-60029msec
  WRITE: bw=2957KiB/s (3028kB/s), 2957KiB/s-2957KiB/s (3028kB/s-3028kB/s), io=173MiB (182MB), run=60010-60010msec
  WRITE: bw=2966KiB/s (3037kB/s), 2966KiB/s-2966KiB/s (3037kB/s-3037kB/s), io=174MiB (182MB), run=60024-60024msec
  WRITE: bw=2850KiB/s (2919kB/s), 2850KiB/s-2850KiB/s (2919kB/s-2919kB/s), io=167MiB (175MB), run=60020-60020msec
  WRITE: bw=2857KiB/s (2925kB/s), 2857KiB/s-2857KiB/s (2925kB/s-2925kB/s), io=167MiB (176MB), run=60009-60009msec
  WRITE: bw=2882KiB/s (2951kB/s), 2882KiB/s-2882KiB/s (2951kB/s-2951kB/s), io=169MiB (177MB), run=60017-60017msec

running test - fio2
--------------------------------------------------------------------------------
   READ: bw=237MiB/s (249MB/s), 237MiB/s-237MiB/s (249MB/s-249MB/s), io=9564MiB (10.0GB), run=40339-40339msec
   READ: bw=248MiB/s (260MB/s), 248MiB/s-248MiB/s (260MB/s-260MB/s), io=9124MiB (9567MB), run=36805-36805msec
   READ: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=9480MiB (9941MB), run=38839-38839msec
   READ: bw=248MiB/s (260MB/s), 248MiB/s-248MiB/s (260MB/s-260MB/s), io=9452MiB (9911MB), run=38069-38069msec
   READ: bw=249MiB/s (261MB/s), 249MiB/s-249MiB/s (261MB/s-261MB/s), io=9100MiB (9542MB), run=36514-36514msec
   READ: bw=245MiB/s (257MB/s), 245MiB/s-245MiB/s (257MB/s-257MB/s), io=9444MiB (9903MB), run=38536-38536msec
   READ: bw=251MiB/s (263MB/s), 251MiB/s-251MiB/s (263MB/s-263MB/s), io=8900MiB (9332MB), run=35473-35473msec
   READ: bw=243MiB/s (255MB/s), 243MiB/s-243MiB/s (255MB/s-255MB/s), io=9564MiB (10.0GB), run=39399-39399msec
   READ: bw=246MiB/s (258MB/s), 246MiB/s-246MiB/s (258MB/s-258MB/s), io=9520MiB (9982MB), run=38684-38684msec
   READ: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=9536MiB (9999MB), run=39022-39022msec

arm64 Linux:
Code:
running test - zsh
--------------------------------------------------------------------------------
zsh -i -c "sleep 0.1 ; exit"  0.26s user 0.28s system 92% cpu 0.584 total
zsh -i -c "sleep 0.1 ; exit"  0.24s user 0.25s system 92% cpu 0.532 total
zsh -i -c "sleep 0.1 ; exit"  0.24s user 0.22s system 92% cpu 0.502 total
zsh -i -c "sleep 0.1 ; exit"  0.28s user 0.24s system 92% cpu 0.558 total
zsh -i -c "sleep 0.1 ; exit"  0.26s user 0.30s system 93% cpu 0.598 total
zsh -i -c "sleep 0.1 ; exit"  0.24s user 0.24s system 91% cpu 0.529 total
zsh -i -c "sleep 0.1 ; exit"  0.29s user 0.27s system 92% cpu 0.604 total
zsh -i -c "sleep 0.1 ; exit"  0.26s user 0.21s system 87% cpu 0.536 total
zsh -i -c "sleep 0.1 ; exit"  0.24s user 0.27s system 92% cpu 0.544 total
zsh -i -c "sleep 0.1 ; exit"  0.27s user 0.26s system 91% cpu 0.577 total

running test - dd
--------------------------------------------------------------------------------
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 7.24124 s, 593 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 12.8347 s, 335 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 12.9368 s, 332 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 12.9022 s, 333 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 12.892 s, 333 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.0925 s, 328 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.0692 s, 329 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.0524 s, 329 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.0485 s, 329 MB/s
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 13.0796 s, 328 MB/s

running test - fio1
--------------------------------------------------------------------------------
  WRITE: bw=17.0MiB/s (17.8MB/s), 17.0MiB/s-17.0MiB/s (17.8MB/s-17.8MB/s), io=1022MiB (1071MB), run=60040-60040msec
  WRITE: bw=15.7MiB/s (16.5MB/s), 15.7MiB/s-15.7MiB/s (16.5MB/s-16.5MB/s), io=946MiB (992MB), run=60081-60081msec
  WRITE: bw=15.7MiB/s (16.5MB/s), 15.7MiB/s-15.7MiB/s (16.5MB/s-16.5MB/s), io=942MiB (988MB), run=60040-60040msec
  WRITE: bw=15.8MiB/s (16.6MB/s), 15.8MiB/s-15.8MiB/s (16.6MB/s-16.6MB/s), io=950MiB (996MB), run=60014-60014msec
  WRITE: bw=15.8MiB/s (16.5MB/s), 15.8MiB/s-15.8MiB/s (16.5MB/s-16.5MB/s), io=947MiB (993MB), run=60037-60037msec
  WRITE: bw=15.7MiB/s (16.4MB/s), 15.7MiB/s-15.7MiB/s (16.4MB/s-16.4MB/s), io=942MiB (988MB), run=60077-60077msec
  WRITE: bw=15.8MiB/s (16.5MB/s), 15.8MiB/s-15.8MiB/s (16.5MB/s-16.5MB/s), io=946MiB (992MB), run=60048-60048msec
  WRITE: bw=15.9MiB/s (16.6MB/s), 15.9MiB/s-15.9MiB/s (16.6MB/s-16.6MB/s), io=953MiB (999MB), run=60054-60054msec
  WRITE: bw=15.8MiB/s (16.6MB/s), 15.8MiB/s-15.8MiB/s (16.6MB/s-16.6MB/s), io=950MiB (996MB), run=60011-60011msec
  WRITE: bw=15.7MiB/s (16.5MB/s), 15.7MiB/s-15.7MiB/s (16.5MB/s-16.5MB/s), io=945MiB (990MB), run=60018-60018msec

running test - fio2
--------------------------------------------------------------------------------
   READ: bw=365MiB/s (383MB/s), 365MiB/s-365MiB/s (383MB/s-383MB/s), io=8556MiB (8972MB), run=23430-23430msec
   READ: bw=372MiB/s (390MB/s), 372MiB/s-372MiB/s (390MB/s-390MB/s), io=8544MiB (8959MB), run=22950-22950msec
   READ: bw=373MiB/s (391MB/s), 373MiB/s-373MiB/s (391MB/s-391MB/s), io=8548MiB (8963MB), run=22897-22897msec
   READ: bw=376MiB/s (395MB/s), 376MiB/s-376MiB/s (395MB/s-395MB/s), io=8540MiB (8955MB), run=22692-22692msec
   READ: bw=385MiB/s (404MB/s), 385MiB/s-385MiB/s (404MB/s-404MB/s), io=8536MiB (8951MB), run=22166-22166msec
   READ: bw=378MiB/s (397MB/s), 378MiB/s-378MiB/s (397MB/s-397MB/s), io=8508MiB (8921MB), run=22480-22480msec
   READ: bw=388MiB/s (407MB/s), 388MiB/s-388MiB/s (407MB/s-407MB/s), io=8508MiB (8921MB), run=21937-21937msec
   READ: bw=385MiB/s (404MB/s), 385MiB/s-385MiB/s (404MB/s-404MB/s), io=8484MiB (8896MB), run=22026-22026msec
   READ: bw=381MiB/s (400MB/s), 381MiB/s-381MiB/s (400MB/s-400MB/s), io=8496MiB (8909MB), run=22287-22287msec
   READ: bw=384MiB/s (403MB/s), 384MiB/s-384MiB/s (403MB/s-403MB/s), io=8492MiB (8905MB), run=22115-22115msec

For curiosity, here is my 10 years old laptop with 5 years old (used) 2.5" SSD:
Code:
running test - zsh
--------------------------------------------------------------------------------
zsh -i -c "sleep 0.1 ; exit"  0.05s user 0.05s system 53% cpu 0.202 total
zsh -i -c "sleep 0.1 ; exit"  0.07s user 0.04s system 51% cpu 0.205 total
zsh -i -c "sleep 0.1 ; exit"  0.10s user 0.04s system 61% cpu 0.229 total
zsh -i -c "sleep 0.1 ; exit"  0.08s user 0.09s system 64% cpu 0.262 total
zsh -i -c "sleep 0.1 ; exit"  0.08s user 0.05s system 34% cpu 0.382 total
zsh -i -c "sleep 0.1 ; exit"  0.04s user 0.05s system 50% cpu 0.187 total
zsh -i -c "sleep 0.1 ; exit"  0.06s user 0.06s system 52% cpu 0.208 total
zsh -i -c "sleep 0.1 ; exit"  0.06s user 0.05s system 51% cpu 0.226 total
zsh -i -c "sleep 0.1 ; exit"  0.06s user 0.06s system 41% cpu 0.287 total
zsh -i -c "sleep 0.1 ; exit"  0.09s user 0.04s system 51% cpu 0.240 total

running test - dd
--------------------------------------------------------------------------------
4294967296 bytes transferred in 11.294465 secs (380271852 bytes/sec)
4294967296 bytes transferred in 10.897883 secs (394110251 bytes/sec)
4294967296 bytes transferred in 10.766730 secs (398911013 bytes/sec)
4294967296 bytes transferred in 11.137108 secs (385644747 bytes/sec)
4294967296 bytes transferred in 10.855323 secs (395655428 bytes/sec)
4294967296 bytes transferred in 11.843629 secs (362639471 bytes/sec)
4294967296 bytes transferred in 12.342351 secs (347986164 bytes/sec)
4294967296 bytes transferred in 13.853498 secs (310027650 bytes/sec)
4294967296 bytes transferred in 11.067325 secs (388076375 bytes/sec)
4294967296 bytes transferred in 12.458829 secs (344732823 bytes/sec)

running test - fio1
--------------------------------------------------------------------------------
  WRITE: bw=13.3MiB/s (14.0MB/s), 13.3MiB/s-13.3MiB/s (14.0MB/s-14.0MB/s), io=802MiB (841MB), run=60164-60164msec
  WRITE: bw=12.7MiB/s (13.3MB/s), 12.7MiB/s-12.7MiB/s (13.3MB/s-13.3MB/s), io=762MiB (799MB), run=60021-60021msec
  WRITE: bw=8948KiB/s (9162kB/s), 8948KiB/s-8948KiB/s (9162kB/s-9162kB/s), io=527MiB (553MB), run=60347-60347msec
  WRITE: bw=11.7MiB/s (12.2MB/s), 11.7MiB/s-11.7MiB/s (12.2MB/s-12.2MB/s), io=700MiB (734MB), run=60020-60020msec
  WRITE: bw=13.2MiB/s (13.9MB/s), 13.2MiB/s-13.2MiB/s (13.9MB/s-13.9MB/s), io=793MiB (832MB), run=60021-60021msec
  WRITE: bw=14.4MiB/s (15.1MB/s), 14.4MiB/s-14.4MiB/s (15.1MB/s-15.1MB/s), io=867MiB (909MB), run=60062-60062msec
  WRITE: bw=11.2MiB/s (11.8MB/s), 11.2MiB/s-11.2MiB/s (11.8MB/s-11.8MB/s), io=687MiB (720MB), run=61089-61089msec
  WRITE: bw=8843KiB/s (9055kB/s), 8843KiB/s-8843KiB/s (9055kB/s-9055kB/s), io=518MiB (543MB), run=60007-60007msec
  WRITE: bw=10.1MiB/s (10.6MB/s), 10.1MiB/s-10.1MiB/s (10.6MB/s-10.6MB/s), io=608MiB (638MB), run=60006-60006msec
  WRITE: bw=9.95MiB/s (10.4MB/s), 9.95MiB/s-9.95MiB/s (10.4MB/s-10.4MB/s), io=597MiB (626MB), run=60010-60010msec

running test - fio2
--------------------------------------------------------------------------------
   READ: bw=295MiB/s (310MB/s), 295MiB/s-295MiB/s (310MB/s-310MB/s), io=9436MiB (9894MB), run=31967-31967msec
   READ: bw=267MiB/s (280MB/s), 267MiB/s-267MiB/s (280MB/s-280MB/s), io=9160MiB (9605MB), run=34305-34305msec
   READ: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), io=9528MiB (9991MB), run=34447-34447msec
   READ: bw=317MiB/s (333MB/s), 317MiB/s-317MiB/s (333MB/s-333MB/s), io=9156MiB (9601MB), run=28851-28851msec
   READ: bw=359MiB/s (377MB/s), 359MiB/s-359MiB/s (377MB/s-377MB/s), io=8992MiB (9429MB), run=25030-25030msec
   READ: bw=337MiB/s (354MB/s), 337MiB/s-337MiB/s (354MB/s-354MB/s), io=9096MiB (9538MB), run=26979-26979msec
   READ: bw=363MiB/s (381MB/s), 363MiB/s-363MiB/s (381MB/s-381MB/s), io=9084MiB (9525MB), run=25028-25028msec
   READ: bw=356MiB/s (373MB/s), 356MiB/s-356MiB/s (373MB/s-373MB/s), io=9020MiB (9458MB), run=25372-25372msec
   READ: bw=330MiB/s (346MB/s), 330MiB/s-330MiB/s (346MB/s-346MB/s), io=9056MiB (9496MB), run=27464-27464msec
   READ: bw=170MiB/s (178MB/s), 170MiB/s-170MiB/s (178MB/s-178MB/s), io=9660MiB (10.1GB), run=56943-56943msec

More details about disk subsystem on slow configuration:
Code:
pciconf -lcv
pcib1@pci0:0:0:0:       class=0x060400 rev=0x00 hdr=0x01 vendor=0x1d87 device=0x0100 subvendor=0x0000 subdevice=0x0000
    vendor     = 'Rockchip Electronics Co., Ltd'
    device     = 'RK3399 PCI Express Root Port'
    class      = bridge
    subclass   = PCI-PCI
    cap 01[80] = powerspec 3  supports D0 D1 D3  current D0
    cap 05[90] = MSI supports 1 message, 64 bit, vector masks
    cap 11[b0] = MSI-X supports 1 message
                 Table in map 0x10[0x0], PBA in map 0x10[0x8]
    cap 10[c0] = PCI-Express 2 root port max data 128(256) RO NS ARI disabled
                 max read 128
                 link x4(x4) speed 5.0(5.0) ASPM disabled(L0s/L1)
                 slot 0 power limit 0 mW
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0017[274] = TPH Requester 1
nvme0@pci0:1:0:0:       class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa804 subvendor=0x144d subdevice=0xa801
    vendor     = 'Samsung Electronics Co Ltd'
    device     = 'NVMe SSD Controller SM961/PM961/SM963'
    class      = mass storage
    subclass   = NVM
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 32 messages, 64 bit
    cap 10[70] = PCI-Express 2 endpoint max data 128(256) FLR RO NS
                 max read 512
                 link x4(x4) speed 5.0(8.0) ASPM disabled(L1) ClockPM disabled
    cap 11[b0] = MSI-X supports 8 messages, enabled
                 Table in map 0x10[0x3000], PBA in map 0x10[0x2000]
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 0 corrected
    ecap 0003[148] = Serial 1 0000000000000000
    ecap 0004[158] = Power Budgeting 1
    ecap 0019[168] = PCIe Sec 1 lane errors 0
    ecap 0018[188] = LTR 1
    ecap 001e[190] = L1 PM Substates 1

smartctl -a /dev/nvme0 | grep -i temperature
Temperature:                        45 Celsius
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               68 Celsius

nvmecontrol identify nvme0ns1
Size:                        500118192 blocks
Capacity:                    500118192 blocks
Utilization:                 500118192 blocks
Thin Provisioning:           Not Supported
Number of LBA Formats:       1
Current LBA Format:          LBA Format #00
Metadata Capabilities
  Extended:                  Not Supported
  Separate:                  Not Supported
Data Protection Caps:        Not Supported
Data Protection Settings:    Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities:    Not Supported
Format Progress Indicator:   0% remains
Deallocate Logical Block:    Read Not Reported
Optimal I/O Boundary:        0 blocks
NVM Capacity:                256060514304 bytes
Globally Unique Identifier:  00000000000000000000000000000000
IEEE EUI64:                  002538ca61006f20
LBA Format #00: Data Size:   512  Metadata Size:     0  Performance: Best

TL:DR;
zsh startup:
arm64 FreeBSD: 1.0469 s
arm64 Linux: 0.5564 s
amd64 FreeBSD old laptop: 0.2428 s

How to debug it? How to make it more responsive?
 
The dd test reading from your SSD is reasonable, within a factor of 2. Getting a few hundred MB/s is good enough for now. Let's not worry about it in round 1. We can tune it later.

Same is true for the second fio test: fast enough for now.

The problem with the first fio test is that I can't figure out what device you're testing this against. Does this run through the file system, or to a raw device? I'm too lazy to look up the man page for fio, and it's not installed on my system. Please explain.

That leaves the zsh problem. Something is very wrong when running a fundamentally empty command under zsh takes 3/4 of a second of CPU time. As covecat already said, check whether it is zsh itself, by running with another shell. I don't have zsh installed on my system, but with either the stock FreeBSD sh or with bash, the answer is less than 10ms of CPU time. I actually suspect that your problem is not zsh itself, but a variety of setup files. I've seen that on some Linux machines: If you allow packages and corporate edicts to install hundreds of setup files that the shell has to execute on startup, it can feel very sluggish. Try removing all shell startup files and recheck.
 
try the same sleep test with /bin/sh and /rescue/sh (statically linked binary)
Code:
for i in $(seq 1 10) ; do time sh -i -c "sleep 0.1 ; exit" ; done
sh -i -c "sleep 0.1 ; exit"  0.03s user 0.04s system 37% cpu 0.169 total
sh -i -c "sleep 0.1 ; exit"  0.02s user 0.04s system 37% cpu 0.172 total
sh -i -c "sleep 0.1 ; exit"  0.02s user 0.05s system 38% cpu 0.176 total
sh -i -c "sleep 0.1 ; exit"  0.02s user 0.06s system 40% cpu 0.182 total
sh -i -c "sleep 0.1 ; exit"  0.01s user 0.06s system 38% cpu 0.176 total
sh -i -c "sleep 0.1 ; exit"  0.00s user 0.06s system 36% cpu 0.171 total
sh -i -c "sleep 0.1 ; exit"  0.02s user 0.05s system 39% cpu 0.173 total
sh -i -c "sleep 0.1 ; exit"  0.01s user 0.06s system 38% cpu 0.176 total
sh -i -c "sleep 0.1 ; exit"  0.01s user 0.05s system 36% cpu 0.171 total
sh -i -c "sleep 0.1 ; exit"  0.01s user 0.06s system 38% cpu 0.176 total

for i in $(seq 1 10) ; do time /rescue/sh -i -c "sleep 0.1 ; exit" ; done
/rescue/sh -i -c "sleep 0.1 ; exit"  0.02s user 0.14s system 39% cpu 0.401 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.01s user 0.05s system 37% cpu 0.175 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.02s user 0.04s system 36% cpu 0.173 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.01s user 0.05s system 36% cpu 0.163 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.01s user 0.03s system 30% cpu 0.159 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.01s user 0.05s system 35% cpu 0.166 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.01s user 0.04s system 30% cpu 0.159 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.02s user 0.03s system 30% cpu 0.159 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.01s user 0.05s system 35% cpu 0.170 total
/rescue/sh -i -c "sleep 0.1 ; exit"  0.01s user 0.05s system 35% cpu 0.167 total

Without sleep:
Code:
for i in $(seq 1 10) ; do time sh -i -c "exit" ; done
sh -i -c "exit"  0.02s user 0.02s system 94% cpu 0.048 total
sh -i -c "exit"  0.00s user 0.05s system 94% cpu 0.048 total
sh -i -c "exit"  0.00s user 0.05s system 94% cpu 0.048 total
sh -i -c "exit"  0.01s user 0.04s system 94% cpu 0.048 total
sh -i -c "exit"  0.02s user 0.03s system 94% cpu 0.048 total
sh -i -c "exit"  0.02s user 0.03s system 94% cpu 0.048 total
sh -i -c "exit"  0.02s user 0.03s system 94% cpu 0.048 total
sh -i -c "exit"  0.00s user 0.05s system 94% cpu 0.048 total
sh -i -c "exit"  0.01s user 0.01s system 91% cpu 0.032 total
sh -i -c "exit"  0.02s user 0.03s system 94% cpu 0.049 total

for i in $(seq 1 10) ; do time /rescue/sh -i -c "exit" ; done
/rescue/sh -i -c "exit"  0.01s user 0.02s system 83% cpu 0.031 total
/rescue/sh -i -c "exit"  0.02s user 0.02s system 88% cpu 0.045 total
/rescue/sh -i -c "exit"  0.02s user 0.02s system 88% cpu 0.044 total
/rescue/sh -i -c "exit"  0.00s user 0.03s system 84% cpu 0.032 total
/rescue/sh -i -c "exit"  0.01s user 0.01s system 82% cpu 0.034 total
/rescue/sh -i -c "exit"  0.01s user 0.01s system 89% cpu 0.029 total
/rescue/sh -i -c "exit"  0.00s user 0.04s system 93% cpu 0.042 total
/rescue/sh -i -c "exit"  0.02s user 0.02s system 93% cpu 0.043 total
/rescue/sh -i -c "exit"  0.01s user 0.03s system 93% cpu 0.043 total
/rescue/sh -i -c "exit"  0.02s user 0.02s system 93% cpu 0.043 total
 
Redone zsh with full config file (without delay of 0.1s):

Code:
for i in $(seq 1 10) ; do /usr/bin/time zsh -i -c exit ; done
        1.66 real         0.92 user         0.87 sys
        1.28 real         0.68 user         0.67 sys
        1.57 real         0.81 user         0.93 sys
        1.33 real         0.70 user         0.78 sys
        1.12 real         0.66 user         0.52 sys
        1.07 real         0.55 user         0.59 sys
        1.65 real         0.99 user         0.80 sys
        1.22 real         0.65 user         0.72 sys
        1.52 real         0.84 user         0.82 sys
        1.29 real         0.74 user         0.65 sys

zsh without .zshrc:
Code:
 for i in $(seq 1 10) ; do /usr/bin/time zsh -i -c exit ; done
        0.10 real         0.03 user         0.07 sys
        0.10 real         0.03 user         0.06 sys
        0.10 real         0.00 user         0.09 sys
        0.10 real         0.02 user         0.07 sys
        0.10 real         0.03 user         0.06 sys
        0.10 real         0.03 user         0.05 sys
        0.06 real         0.02 user         0.04 sys
        0.06 real         0.02 user         0.04 sys
        0.10 real         0.03 user         0.07 sys
        0.06 real         0.01 user         0.04 sys

For comparison - 10 years old amd64 laptop with full config:
Code:
for i in $(seq 1 10) ; do /usr/bin/time zsh -i -c exit ; done
        0.19 real         0.08 user         0.08 sys
        0.13 real         0.11 user         0.03 sys
        0.10 real         0.06 user         0.04 sys
        0.10 real         0.06 user         0.05 sys
        0.09 real         0.07 user         0.03 sys
        0.14 real         0.05 user         0.06 sys
        0.15 real         0.07 user         0.06 sys
        0.08 real         0.04 user         0.05 sys
        0.08 real         0.06 user         0.03 sys
        0.12 real         0.06 user         0.05 sys

I do not use oh-my-zsh, npn, git in prompt or any of that modern monstrosities.
The only (plugins?) I use:
Code:
# pkg info -x zsh
zsh-5.8
zsh-completions-0.33.0
zsh-syntax-highlighting-0.7.1,1
and sourcing /usr/local/share/examples/fzf/shell/key-bindings.zsh.

Yes, my config file will slow down startup of zsh but it will trade that time for doing something useful (to me atleast). Startup under 200 ms is acceptable (for now :D)
My question is why that same zsh config is much slower on FreeBSD arm64 than on Linux arm64 (on the same machine) or FreeBSD amd64 (not on the same machine).
Even things like man ls feel slower.

Freshly cloned FreeBSD src git repo (on ZFS filesystem):
Code:
git clone https://git.FreeBSD.org/src.git
...
for i in $(seq 1 10) ; do time git status > /dev/null ; done
git status > /dev/null  1.10s user 21.67s system 177% cpu 12.835 total
git status > /dev/null  0.80s user 17.02s system 220% cpu 8.070 total
git status > /dev/null  0.78s user 10.19s system 164% cpu 6.661 total
git status > /dev/null  1.08s user 14.49s system 149% cpu 10.438 total
git status > /dev/null  0.98s user 16.57s system 191% cpu 9.186 total
git status > /dev/null  0.94s user 13.99s system 129% cpu 11.557 total
git status > /dev/null  1.34s user 18.53s system 155% cpu 12.774 total
git status > /dev/null  0.87s user 16.36s system 214% cpu 8.017 total
git status > /dev/null  1.08s user 15.05s system 168% cpu 9.545 total
git status > /dev/null  0.93s user 16.78s system 169% cpu 10.473 total

To me it seems that something with file operation is slow, so lets try to prove/measure it.
I made a simple test program which should simulate file operations (opens and reads a file and list files in a directory) - something I am guessing that zsh/vim/git will also do.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>

#define FILE_TO_READ "/mnt/data/test/file"
#define DIR_TO_READ "/mnt/data/test/dir"
#define N 10000

void read_file(void)
{
    FILE *fp;
    char buffer[100];

    fp = fopen(FILE_TO_READ, "r");
    if (fp == NULL)
    {
        printf("Error opening file\n");
        exit(1);
    }

    fgets(buffer, 100, fp);
    printf("buffer: %s\n", buffer);
    fclose(fp);
}

void read_dir(void)
{
    DIR *dir;
    struct dirent *dp;
    char * file_name;

    dir = opendir(DIR_TO_READ);
    if (dir == NULL)
    {
        printf("Error opening dir\n");
        exit(1);
    }

    while ((dp = readdir(dir)) != NULL)
    {
        file_name = dp->d_name;
        printf("\"%s\"\n",file_name);
    }
    closedir(dir);
}

int main(void)
{

    for (int i=0; i<N; i++)
    {
        read_file();
        read_dir();
    }

    return 0;
}

FreeBSD arm64:
Code:
for i in $(seq 1 10) ; do time ./test-file-open > /dev/null ; done
./test-file-open > /dev/null  2.13s user 5.56s system 99% cpu 7.692 total
./test-file-open > /dev/null  2.05s user 5.01s system 99% cpu 7.063 total
./test-file-open > /dev/null  2.59s user 6.51s system 99% cpu 9.111 total
./test-file-open > /dev/null  2.53s user 6.41s system 99% cpu 8.952 total
./test-file-open > /dev/null  2.83s user 6.44s system 99% cpu 9.286 total
./test-file-open > /dev/null  1.84s user 5.24s system 99% cpu 7.084 total
./test-file-open > /dev/null  2.50s user 6.58s system 99% cpu 9.085 total
./test-file-open > /dev/null  2.60s user 6.46s system 99% cpu 9.071 total
./test-file-open > /dev/null  2.24s user 5.43s system 99% cpu 7.686 total
./test-file-open > /dev/null  2.17s user 5.70s system 99% cpu 7.879 total

Linux arm64 (same hardware, same FS, same test program, same repo):
Code:
for i in $(seq 1 10) ; do time ./test-file-open > /dev/null ; done
./test-file-open > /dev/null  0.34s user 0.83s system 98% cpu 1.176 total
./test-file-open > /dev/null  0.36s user 0.80s system 99% cpu 1.170 total
./test-file-open > /dev/null  0.31s user 0.85s system 99% cpu 1.169 total
./test-file-open > /dev/null  0.37s user 0.79s system 99% cpu 1.170 total
./test-file-open > /dev/null  0.33s user 0.81s system 99% cpu 1.155 total
./test-file-open > /dev/null  0.28s user 0.86s system 99% cpu 1.149 total
./test-file-open > /dev/null  0.31s user 0.86s system 99% cpu 1.178 total
./test-file-open > /dev/null  0.33s user 0.82s system 99% cpu 1.151 total
./test-file-open > /dev/null  0.30s user 0.86s system 99% cpu 1.166 total
./test-file-open > /dev/null  0.35s user 0.82s system 99% cpu 1.176 total

for i in $(seq 1 10) ; do time git status > /dev/null ; done
git status > /dev/null  0.36s user 1.84s system 193% cpu 1.143 total
git status > /dev/null  0.30s user 1.64s system 186% cpu 1.042 total
git status > /dev/null  0.33s user 1.57s system 186% cpu 1.020 total
git status > /dev/null  0.33s user 1.63s system 187% cpu 1.044 total
git status > /dev/null  0.27s user 1.53s system 156% cpu 1.151 total
git status > /dev/null  0.30s user 1.39s system 173% cpu 0.977 total
git status > /dev/null  0.32s user 1.53s system 185% cpu 1.000 total
git status > /dev/null  0.28s user 1.63s system 186% cpu 1.022 total
git status > /dev/null  0.32s user 1.62s system 184% cpu 1.047 total
git status > /dev/null  0.26s user 1.54s system 159% cpu 1.125 total

FreeBSD git: 10 s
Linux git: 1.06 s
FreeBSD test.c: 8.3 s
Linux test.c: 1.17 s

So FreeBSD on the same hardware and same FS and repo is 10x slower than Linux. And 7x slower on simple C test program.
 
Redone tests without powerd/powerdxx and CPU freq set to max:
Code:
for i in $(seq 1 10) ; do time git status > /dev/null ; done
git status > /dev/null  0.68s user 10.54s system 174% cpu 6.426 total
git status > /dev/null  0.70s user 9.98s system 171% cpu 6.223 total
git status > /dev/null  0.55s user 9.33s system 153% cpu 6.420 total
git status > /dev/null  0.60s user 10.06s system 175% cpu 6.055 total
git status > /dev/null  0.61s user 10.36s system 156% cpu 7.018 total
git status > /dev/null  0.66s user 10.62s system 153% cpu 7.367 total
git status > /dev/null  0.66s user 9.80s system 166% cpu 6.288 total
git status > /dev/null  0.65s user 9.63s system 148% cpu 6.923 total
git status > /dev/null  0.75s user 9.91s system 156% cpu 6.804 total
git status > /dev/null  0.66s user 10.59s system 167% cpu 6.715 total

for i in $(seq 1 10) ; do time ./test-file-open > /dev/null ; done
./test-file-open > /dev/null  0.57s user 1.13s system 99% cpu 1.698 total
./test-file-open > /dev/null  0.97s user 2.79s system 99% cpu 3.763 total
./test-file-open > /dev/null  0.54s user 1.14s system 99% cpu 1.682 total
./test-file-open > /dev/null  1.08s user 2.64s system 99% cpu 3.724 total
./test-file-open > /dev/null  1.04s user 2.70s system 99% cpu 3.735 total
./test-file-open > /dev/null  1.02s user 2.69s system 99% cpu 3.719 total
./test-file-open > /dev/null  0.77s user 2.14s system 99% cpu 2.913 total
./test-file-open > /dev/null  0.57s user 1.16s system 99% cpu 1.731 total
./test-file-open > /dev/null  0.63s user 1.11s system 99% cpu 1.746 total
./test-file-open > /dev/null  1.10s user 2.62s system 99% cpu 3.712 total


FreeBSD git: 6.6 s
Linux git: 1.1 s
FreeBSD test.c: 2.84 s
Linux test.c: 1.17 s

Better, but git status is still 5x slower on FreeBSD and sample test program is 2.4x slower
 
what does "vmstat -i" output? do you have rctl enabled? what are your sysctls kern.randompid and kern.sched.preempt_thresh?
 
Roughly how many files and directories are in the directory tree that your test-file-open program runs over?

Still, whatever the scale is, the fact that Linux on the same hardware with the same size directory tree is 5-10 times slower is really worrisome. I presume they all use "reasonable" (non-exotic) file systems? There is no FUSE anywhere nearby? No network file systems, iSCSI, loop mounts?
 
Which version of FreeBSD, exactly?

… arm64 (rk3399) …

… same zsh config is much slower on FreeBSD arm64 than on Linux arm64 (on the same machine) or FreeBSD amd64 (not on the same machine). … Freshly cloned FreeBSD src git repo (on ZFS filesystem): …

Which versions of ZFS?

FreeBSD git: 10 s
Linux git: 1.06 s
FreeBSD test.c: 8.3 s
Linux test.c: 1.17 s

Are all file system properties identical?

L2ARC in either case?
 
ok i did some testing on pi zero (armv6)
after playing a bit with the c source above i reached the conclusion that printf performance sucks
then i built libc with gcc10 (kind of because some files failed to compile (as failed with bad instruction ??) and i built those objects with clang
i just went to /usr/src/lib/libc and make CC=gcc
gcc compiled libc is significantly faster
Code:
[user@rpi-b ~]$ cc test.c -o test
[user@rpi-b ~]$ unset LD_LIBRARY_PATH
[user@rpi-b ~]$ ldd ./test
./test:
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x200af000)
        libc.so.7 => /lib/libc.so.7 (0x200ec000)
[user@rpi-b ~]$ time ./test >/dev/null

real    0m15.187s
user    0m10.797s
sys     0m4.168s
[user@rpi-b ~]$ export LD_LIBRARY_PATH=/home/user
[user@rpi-b ~]$ ldd ./test
./test:
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x200af000)
        libc.so.7 => /home/user/libc.so.7 (0x20100000)
[user@rpi-b ~]$ time ./test >/dev/null

real    0m11.977s
user    0m7.502s
sys     0m4.289s
[user@rpi-b ~]$ ls -l dir/|wc -l
     168
test 2
Code:
[user@rpi-b ~]$ unset LD_LIBRARY_PATH
[user@rpi-b ~]$ time LC_ALL=C col -b </boot/kernel/kernel |md5
col: warning: can't back up -- line already flushed
f981722f1943c540c0d6ebb02d74228f

real    0m15.969s
user    0m15.270s
sys     0m0.476s
[user@rpi-b ~]$ export LD_LIBRARY_PATH=/home/user
[user@rpi-b ~]$ time LC_ALL=C col -b </boot/kernel/kernel |md5
col: warning: can't back up -- line already flushed
f981722f1943c540c0d6ebb02d74228f

real    0m8.958s
user    0m8.280s
sys     0m0.545s
here gcc compiled libc is 2x faster than the clang one (md5 is only for verification, makes no speed diff)
 
what does "vmstat -i" output? do you have rctl enabled? what are your sysctls kern.randompid and kern.sched.preempt_thresh?

Code:
# sysctl kern.randompid kern.sched.preempt_thresh
kern.randompid: 0
kern.sched.preempt_thresh: 80

# rctl
rctl: RACCT/RCTL present, but disabled; enable using kern.racct.enable=1 tunable
zsh: exit 1     rctl

# vmstat -i
interrupt                                             total       rate
gic0,p11:-ic_timer0                                  247971        270
gic0,s11:-dhci_fdt0                                     831          1
gic0,s26: ehci0                                           2          0
gic0,s28: ohci0                                          77          0
gic0,s34: rk_i2c2                                        18          0
gic0,s53: spi0                                         7431          8
gic0,s57: rk_i2c4                                      3879          4
gic0,s65:-ip_dwmmc1                                     147          0
gic0,s100: uart1                                        222          0
gic0,s110: xhci0                                      55354         60
its0,0: nvme0                                            20          0
its0,1: nvme0                                          3655          4
its0,2: nvme0                                          3199          3
its0,3: nvme0                                          3364          4
its0,4: nvme0                                          3395          4
its0,5: nvme0                                          4210          5
its0,6: nvme0                                          4181          5
cpu0:ast                                                 89          0
cpu1:ast                                                119          0
cpu2:ast                                                108          0
cpu3:ast                                                125          0
cpu4:ast                                                130          0
cpu5:ast                                                 82          0
cpu0:preempt                                          64752         71
cpu1:preempt                                          77734         85
cpu2:preempt                                          82602         90
cpu3:preempt                                          76268         83
cpu4:preempt                                          43678         48
cpu5:preempt                                          82451         90
cpu0:rendezvous                                          56          0
cpu1:rendezvous                                          81          0
cpu2:rendezvous                                          36          0
cpu3:rendezvous                                          79          0
cpu4:rendezvous                                          79          0
cpu5:rendezvous                                          80          0
cpu0:hardclock                                         6162          7
Total                                                772667        843

# uptime
 3:39PM  up 16 mins, 11 users, load averages: 0.42, 0.34, 0.32

Roughly how many files and directories are in the directory tree that your test-file-open program runs over?

Still, whatever the scale is, the fact that Linux on the same hardware with the same size directory tree is 5-10 times slower is really worrisome. I presume they all use "reasonable" (non-exotic) file systems? There is no FUSE anywhere nearby? No network file systems, iSCSI, loop mounts?

Directory tree is /etc copied to /mnt/data/test/dir.
File is /usr/share/misc/termcap copied to /mnt/data/test/dir and linked to /mnt/data/test/file.
Code:
# ls -1R /mnt/data/test/dir | wc -l
     788

# find /mnt/data/test/dir -type f | wc -l
     371
# find /mnt/data/test/dir -type d | wc -l
      69
# find /mnt/data/test/dir -type l | wc -l
     213

# ls -lh /mnt/data/test/file /mnt/data/test/termcap ; wc -l /mnt/data/test/file
lrwxr-xr-x  1 root  wheel     7B Jan 12 12:52 /mnt/data/test/file -> termcap
-r--r--r--  1 root  wheel   208K Jan 12 09:54 /mnt/data/test/termcap
    4778 /mnt/data/test/file

Filesystem is same ZFS dataset in both cases. On the same HW.
Only NVMe is used (part1 - EFI, part4 - ZFS which holds data and FreeBSD OS).
Code:
# gpart show -l
=>      40  30535600  mmcsd1  GPT  (15G)
        40  30535600       1  emmc  (15G)

=>     2048  500116111  diskid/DISK-S34WNY0HA03325  GPT  (238G)
       2048     262144                           1  efi  (128M)
     264192  104857600                           2  nvme-armbian  (50G)
  105121792  104857600                           3  nvme-openbsd  (50G)
  209979392  290138760                           4  nvme-zfs  (138G)
  500118152          7                              - free -  (3.5K)

=>      63  30535537  mmcsd1p1  MBR  (15G)
        63  30535537            - free -  (15G)

=>      40  30535600  diskid/DISK-1C15BAC2  GPT  (15G)
        40  30535600                     1  emmc  (15G)

=>      63  30535537  gpt/emmc  MBR  (15G)
        63  30535537            - free -  (15G)

=>      63  30535537  gptid/27043f2f-55e5-11ec-ada9-682719ace28b  MBR  (15G)
        63  30535537                                              - free -  (15G)

=>      63  30535537  diskid/DISK-1C15BAC2p1  MBR  (15G)
        63  30535537                          - free -  (15G)

# mount
miki-zfs/ROOT/generic2 on / (zfs, local, nfsv4acls)
devfs on /dev (devfs)
miki-zfs/ROOT/generic2/usr-local on /usr/local (zfs, local, nfsv4acls)
miki-zfs on /mnt/data (zfs, local, nfsv4acls)
miki-zfs/home-bsd on /usr/home (zfs, local, nfsv4acls)
miki-zfs/ports on /usr/ports (zfs, local, noatime, nosuid, nfsv4acls)
miki-zfs/obj on /usr/obj (zfs, local, nfsv4acls)
miki-zfs/src on /mnt/data/src (zfs, local, nfsv4acls)
miki-zfs/ROOT on /mnt/data/ROOT (zfs, local, nfsv4acls)
miki-zfs/vm on /mnt/data/vm (zfs, local, nfsv4acls)
miki-zfs/home-bsd/johnny on /usr/home/johnny (zfs, local, nfsv4acls)
miki-zfs/root on /mnt/data/root (zfs, local, nfsv4acls)
miki-zfs/home-bsd/root on /usr/home/root (zfs, local, nfsv4acls)
miki-zfs/home on /mnt/data/home (zfs, local, nfsv4acls)
miki-zfs/src-bsd on /usr/src (zfs, local, noatime, nfsv4acls)


# zfs get all miki-zfs
NAME      PROPERTY              VALUE                  SOURCE
miki-zfs  type                  filesystem             -
miki-zfs  creation              Fri Sep 25 12:44 2020  -
miki-zfs  used                  116G                   -
miki-zfs  available             17.3G                  -
miki-zfs  referenced            3.14G                  -
miki-zfs  compressratio         1.51x                  -
miki-zfs  mounted               yes                    -
miki-zfs  quota                 none                   default
miki-zfs  reservation           none                   default
miki-zfs  recordsize            128K                   default
miki-zfs  mountpoint            /mnt/data              local
miki-zfs  sharenfs              off                    default
miki-zfs  checksum              on                     default
miki-zfs  compression           lz4                    local
miki-zfs  atime                 on                     default
miki-zfs  devices               on                     default
miki-zfs  exec                  on                     default
miki-zfs  setuid                on                     default
miki-zfs  readonly              off                    local
miki-zfs  jailed                off                    default
miki-zfs  snapdir               hidden                 default
miki-zfs  aclmode               discard                default
miki-zfs  aclinherit            restricted             default
miki-zfs  createtxg             1                      -
miki-zfs  canmount              on                     default
miki-zfs  xattr                 on                     default
miki-zfs  copies                1                      default
miki-zfs  version               5                      -
miki-zfs  utf8only              off                    -
miki-zfs  normalization         none                   -
miki-zfs  casesensitivity       sensitive              -
miki-zfs  vscan                 off                    default
miki-zfs  nbmand                off                    default
miki-zfs  sharesmb              off                    default
miki-zfs  refquota              none                   default
miki-zfs  refreservation        none                   default
miki-zfs  guid                  12119281857190425953   -
miki-zfs  primarycache          all                    default
miki-zfs  secondarycache        all                    default
miki-zfs  usedbysnapshots       432K                   -
miki-zfs  usedbydataset         3.14G                  -
miki-zfs  usedbychildren        113G                   -
miki-zfs  usedbyrefreservation  0B                     -
miki-zfs  logbias               latency                default
miki-zfs  objsetid              54                     -
miki-zfs  dedup                 off                    default
miki-zfs  mlslabel              none                   default
miki-zfs  sync                  standard               default
miki-zfs  dnodesize             legacy                 default
miki-zfs  refcompressratio      1.23x                  -
miki-zfs  written               3.14G                  -
miki-zfs  logicalused           168G                   -
miki-zfs  logicalreferenced     3.72G                  -
miki-zfs  volmode               default                default
miki-zfs  filesystem_limit      none                   default
miki-zfs  snapshot_limit        none                   default
miki-zfs  filesystem_count      none                   default
miki-zfs  snapshot_count        none                   default
miki-zfs  snapdev               hidden                 default
miki-zfs  acltype               nfsv4                  default
miki-zfs  context               none                   default
miki-zfs  fscontext             none                   default
miki-zfs  defcontext            none                   default
miki-zfs  rootcontext           none                   default
miki-zfs  relatime              off                    default
miki-zfs  redundant_metadata    all                    default
miki-zfs  overlay               on                     default
miki-zfs  encryption            off                    default
miki-zfs  keylocation           none                   default
miki-zfs  keyformat             none                   default
miki-zfs  pbkdf2iters           0                      default
miki-zfs  special_small_blocks  0                      default


Which version of FreeBSD, exactly?
Which versions of ZFS?
Are all file system properties identical?
L2ARC in either case?

It was -CURRENT from a few days ago with GENERIC kernel, lost the exact git hash.
Code:
# zfs-stats -IL

------------------------------------------------------------------------
ZFS Subsystem Report                            Sat Jan 15 22:54:27 2022
------------------------------------------------------------------------

System Information:

        Kernel Version:                         1400047 (osreldate)
        Hardware Platform:                      arm64
        Processor Architecture:                 aarch64

        ZFS Storage pool Version:               5000
        ZFS Filesystem Version:                 5

FreeBSD 14.0-CURRENT #3 master-n252421-8c0c5bdf9d5: Thu Jan 13 18:10:09 CET 2022 root10:54PM  up  2:15, 9 users, load averages: 1.40, 2.86, 4.58

------------------------------------------------------------------------

L2ARC is disabled

------------------------------------------------------------------------


(Me too … then there's the mix of platforms at post 5 …)

I forgot to ask: how much memory in each case?
It's the same hardware - arm64 with 4 GB of RAM.
amd64 ("10 years old laptop") was added as an example.
 
-CURRENT from a few days ago with GENERIC kernel,

Thanks, I do know that CURRENT can be slower for some things; I don't know whether any of the slowness above might be attributable.

Please see <https://github.com/freebsd/freebsd-...445caf92b36df2d8bfc6b76d456d/UPDATING#L14-L27>

A gentle hint, for you to know that FreeBSD 14.0-CURRENT is not supported here:


For performance, more than debugging:

/etc/src.conf

There, the first of the two uncommented lines is to improve performance. I don't know how, technically, but I learnt the line from somewhere.

The second of the two uncommented lines is probably something that I learnt wrong, it seems to cause the build failure that's mentioned at <https://forums.freebsd.org/posts/551280> (I don't expect an answer there). Instead, I normally use KERNCONF=GENERIC-NODEBUG at the command line when building the kernel.

Unusually, for the past few days I have been using a built GENERIC kernel which I suppose should be slower although in everyday use, I can't sense any difference.
 
Thanks for the useful reply, speccially when -CURRENT is mentioned.
I know it is not supported here, will test it with -RELEASE later.

Slow performance was noticed few months ago on -CURRENT compiled by myself (with minimal system and kernel without debug things).
The tests in previous posts were run on generic -CURRENT (without customisation but I forget to include malloc line in make.conf and to use non debug kernel)
But, lets try it again with more controlled environment

Test:
test.c
- run it 10x compiled with cc, clang13 from ports and gcc.
- repeat it two iteration
- pin it to large CPU cores and repeat
- pin it to small CPU cores and repeat

git
- run 10x git status
- pin it to big CPU cores and repeat
- pin it to small CPU cores and repeat
- try other combinations with pinning kernel and/or git to big/little CPU cores

Test data:
- test.c from few posts before
- file open: /usr/share/misc/termcap copied to /mnt/data/test/dir and linked to /mnt/data/test/file
- dir open: /etc copied to /mnt/data/test/dir.
- git repo: FreeBSD git source from few days before, clean, only cloned, placed under /mnt/data/test/src
- all that on the same ZFS dataset which was used on Linux arm64

Test script:
Code:
#!/bin/sh

ITER=2

cd /mnt/data/test
rm test*.out test-file-open-*

service powerd stop   2>&1 > /dev/null
service powerdxx stop 2>&1 > /dev/null
sysctl dev.cpu.5.freq=2016

system_info()
{
    echo "System info:"                 >> test.out
    echo "# uname -srv "                >> test.out
    uname -srv                          >> test.out
    echo ""                             >> test.out

    echo "src-env.conf"                 >> test.out
    grep -Ev '^#|^$' /etc/src-env.conf  >> test.out
    echo ""                             >> test.out

    echo "src.conf"                     >> test.out
    grep -Ev '^#|^$' /etc/src.conf      >> test.out
    echo ""                             >> test.out

    echo "make.conf"                    >> test.out
    grep -Ev '^#|^$' /etc/make.conf     >> test.out
    echo ""                             >> test.out

    echo "bootfs"                       >> test.out
    zpool get -H -o value bootfs        >> test.out
    echo ""                             >> test.out
}

run_test_c()
{
    CPU=$1

    # one empty run, one real run
    for compiler in /usr/bin/cc /usr/local/bin/clang13 /usr/local/bin/gcc10 ; do
        name=$(echo "/usr/bin/cc" | tr '/' '_')
        $compiler -Wall test-file-open.c -o test-file-open-$name

        echo "Doing empty run with $compiler"
        for i in $(seq 1 10) ; do time $(echo $CPU) ./test-file-open-$name ; done > /dev/null

        for i in $(seq 1 10) ; do /usr/bin/time $(echo $CPU) ./test-file-open-$name 2>&1 > /dev/null ; done >> test_tmp.out

        time=$(tail -n 10 test_tmp.out | awk '{print $1}' | paste -sd+ - | bc)

        printf "Testing with %24s %6s seconds\n" $compiler $time >> test.out
    done
}

run_test_git()
{
    CPU=$1
    echo "Doing empty run with git"
    # for i in $(seq 1 10) ; do /usr/bin/time $(echo $CPU) git -C /mnt/data/test/src status > /dev/null ; done

    echo "Doing real run with git"
    for i in $(seq 1 10) ; do /usr/bin/time $(echo $CPU) git -C /mnt/data/test/src status 2>&1 > /dev/null ; done >> test_tmp.out

    time=$(tail -n 10 test_tmp.out | awk '{print $1}' | paste -sd+ - | bc)
    printf "Testing git %6s seconds\n" $time >> test.out
}

run_c_tests()
{
    echo "Testing on default CPU cores" >> test.out
    for i in $(seq 1 $ITER) ; do
        run_test_c
    done

    echo "Testing on big CPU cores (x2)" >> test.out
    for i in $(seq 1 $ITER) ; do
        run_test_c "cpuset -c -l 4,5"
    done

    echo "Testing on little CPU cores (x4)" >> test.out
    for i in $(seq 1 $ITER2) ; do
        run_test_c "cpuset -c -l 0-3"
    done

    echo "Testing git on default CPU cores" >> test.out
    for i in $(seq 1 $ITER) ; do
        run_test_git
    done

    echo "Testing git on big CPU cores (x2)" >> test.out
    for i in $(seq 1 $ITER) ; do
        run_test_git "cpuset -c -l 4,5"
    done

    echo "Testing git on little CPU cores (x4)" >> test.out
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 0-3"
    done
}

run_git_tests()
{
    echo "Testing git on big CPU cores (x2), kernel on big CPU cores (x2)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 4,5 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 4,5"
    done

    echo "Testing git on big CPU cores (x2), kernel on small CPU cores (x4)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 0-3 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 4,5"
    done

    echo "Testing git on small CPU cores (x4), kernel on big CPU cores (x2)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 4,5 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 0-3"
    done

    echo "Testing git on small CPU cores (x4), kernel on small CPU cores (x4)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 0-3 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 0-3"
    done

    echo "Testing git on small CPU cores (x4), kernel on all CPU cores (x6)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 0-5 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 0-3"
    done

    echo "Testing git on big CPU cores (x2), kernel on all CPU cores (x6)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 0-5 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 4,5"
    done

    echo "Testing git on all CPU cores (x6), kernel on all CPU cores (x6)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 0-5 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 0-5"
    done

    echo "Testing git on all CPU cores (x6), kernel on small CPU cores (x6)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 0-3 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 0-5"
    done

    echo "Testing git on all CPU cores (x6), kernel on big CPU cores (x2)" >> test.out
    for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 4,5 -p $i ; done
    for i in $(seq 1 $ITER2) ; do
        run_test_git "cpuset -c -l 0-5"
    done
}

# do not pin kernel process to any specific CPUs:
for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 0-5 -p $i ; done
system_info
run_c_tests
run_git_tests

Results on default FreeBSD -CURRENT (powerd/powerdxx disabled, max CPU freq)
Code:
System info:
# uname -srv
FreeBSD 14.0-CURRENT FreeBSD 14.0-CURRENT #0 master-n252421-8c0c5bdf9d5: Tue Jan 18 18:41:04 CET 2022     root@free-miki:/usr/obj/sys-generic/usr/src/arm64.aarch64/sys/GENERIC-NODEBUG

src-env.conf
WITH_META_MODE=
MAKEOBJDIRPREFIX?=/usr/obj/sys-generic
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/opt/bin

src.conf

make.conf
WITH_MALLOC_PRODUCTION=yes
KERNCONF=GENERIC-NODEBUG

bootfs
miki-zfs/ROOT/generic2

Testing on default CPU cores
Testing with              /usr/bin/cc  15.04 seconds
Testing with   /usr/local/bin/clang13  14.33 seconds
Testing with     /usr/local/bin/gcc10  15.63 seconds
Testing with              /usr/bin/cc  13.51 seconds
Testing with   /usr/local/bin/clang13  17.97 seconds
Testing with     /usr/local/bin/gcc10  16.28 seconds
Testing on big CPU cores (x2)
Testing with              /usr/bin/cc  10.02 seconds
Testing with   /usr/local/bin/clang13  10.00 seconds
Testing with     /usr/local/bin/gcc10  10.08 seconds
Testing with              /usr/bin/cc  10.01 seconds
Testing with   /usr/local/bin/clang13  10.00 seconds
Testing with     /usr/local/bin/gcc10  10.04 seconds
Testing on little CPU cores (x4)
Testing with              /usr/bin/cc  20.44 seconds
Testing with   /usr/local/bin/clang13  20.38 seconds
Testing with     /usr/local/bin/gcc10  20.37 seconds
Testing git on default CPU cores
Testing git  28.60 seconds
Testing git  27.71 seconds
Testing git on big CPU cores (x2)
Testing git  28.34 seconds
Testing git  28.92 seconds
Testing git on little CPU cores (x4)
Testing git  33.81 seconds
Testing git on big CPU cores (x2), kernel on big CPU cores (x2)
Testing git  22.98 seconds
Testing git on big CPU cores (x2), kernel on small CPU cores (x4)
Testing git  30.01 seconds
Testing git on small CPU cores (x4), kernel on big CPU cores (x2)
Testing git  29.13 seconds
Testing git on small CPU cores (x4), kernel on small CPU cores (x4)
Testing git  35.63 seconds
Testing git on small CPU cores (x4), kernel on all CPU cores (x6)
Testing git  30.35 seconds
Testing git on big CPU cores (x2), kernel on all CPU cores (x6)
Testing git  30.29 seconds
Testing git on all CPU cores (x6), kernel on all CPU cores (x6)
Testing git  28.54 seconds
Testing git on all CPU cores (x6), kernel on small CPU cores (x6)
Testing git  26.62 seconds
Testing git on all CPU cores (x6), kernel on big CPU cores (x2)
Testing git  26.15 seconds

Tests are run much faster if powerd/powerdxx is disabled and CPU frequency set to max.
Also pinning (single threaded) test.c to big ARM cores results in significant improvement.
There is not much difference between compilers (when runs are pinned to specific CPU cluster)
Fastest run is 1.0 second per iteration when pinned on two big CPU cores (faster than Linux 1.1s although with dynamic CPU scalling).

Still, fastes FreeBSD git status is significantly slower than Linux (2.3 seconds per iteration vs 1.1s with dynamic CPU scalling.
 
i just went to /usr/src/lib/libc and make CC=gcc
gcc compiled libc is significantly faster

here gcc compiled libc is 2x faster than the clang one (md5 is only for verification, makes no speed diff)

Hmm, interesting find!
I rebuilt libc with gcc, installed it to separate boot environment and rebooted.
Code:
make CC=gcc WITHOUT_TESTS= install

Results: (default FreeBSD -CURRENT but libc built with GCC, powerdxx disabled, max CPU freq)
Code:
System info:
# uname -srv 
FreeBSD 14.0-CURRENT FreeBSD 14.0-CURRENT #0 master-n252421-8c0c5bdf9d5: Tue Jan 18 18:41:04 CET 2022     root@free-miki:/usr/obj/sys-generic/usr/src/arm64.aarch64/sys/GENERIC-NODEBUG 

src-env.conf
WITH_META_MODE=
MAKEOBJDIRPREFIX?=/usr/obj/sys-generic
PATH=/bin:/sbin:/usr/bin:/usr/sbin:/opt/bin

src.conf

make.conf
WITH_MALLOC_PRODUCTION=yes
KERNCONF=GENERIC-NODEBUG

bootfs
miki-zfs/ROOT/generic2-gcc-libc

Testing on default CPU cores
Testing with              /usr/bin/cc  17.52 seconds
Testing with   /usr/local/bin/clang13  17.79 seconds
Testing with     /usr/local/bin/gcc10  14.40 seconds
Testing with              /usr/bin/cc  17.35 seconds
Testing with   /usr/local/bin/clang13  16.89 seconds
Testing with     /usr/local/bin/gcc10  17.96 seconds
Testing on big CPU cores (x2)
Testing with              /usr/bin/cc   9.68 seconds
Testing with   /usr/local/bin/clang13   9.79 seconds
Testing with     /usr/local/bin/gcc10   9.74 seconds
Testing with              /usr/bin/cc   9.70 seconds
Testing with   /usr/local/bin/clang13   9.67 seconds
Testing with     /usr/local/bin/gcc10   9.71 seconds
Testing on little CPU cores (x4)
Testing with              /usr/bin/cc  19.71 seconds
Testing with   /usr/local/bin/clang13  19.94 seconds
Testing with     /usr/local/bin/gcc10  19.86 seconds
Testing git on default CPU cores
Testing git  29.33 seconds
Testing git  28.79 seconds
Testing git on big CPU cores (x2)
Testing git  29.00 seconds
Testing git  28.00 seconds
Testing git on little CPU cores (x4)
Testing git  31.53 seconds
Testing git on big CPU cores (x2), kernel on big CPU cores (x2)
Testing git  24.08 seconds
Testing git on big CPU cores (x2), kernel on small CPU cores (x4)
Testing git  29.36 seconds
Testing git on small CPU cores (x4), kernel on big CPU cores (x2)
Testing git  30.39 seconds
Testing git on small CPU cores (x4), kernel on small CPU cores (x4)
Testing git  38.76 seconds
Testing git on small CPU cores (x4), kernel on all CPU cores (x6)
Testing git  31.95 seconds
Testing git on big CPU cores (x2), kernel on all CPU cores (x6)
Testing git  28.95 seconds
Testing git on all CPU cores (x6), kernel on all CPU cores (x6)
Testing git  27.95 seconds
Testing git on all CPU cores (x6), kernel on small CPU cores (x6)
Testing git  27.97 seconds
Testing git on all CPU cores (x6), kernel on big CPU cores (x2)
Testing git  25.12 seconds

test.c seems little faster with GCC libc (0.96 seconds per iteration on big CPU cores vs 1.0)
Also git status is little faster with GCC libc (2.4 seconds for best result vs 2.3 seconds).
Is there some new lib to recompile it and try again? ?
 
PS, learnt while doing above tests (for future googlers/duckduckgoers):

How to find kernel processes with PID, parrent PID (which is 0 for kernel thread) and process name:
Code:
# ps ax -e -o pid,ppid,args | awk '$2 == 0'
   0    0 [kernel]
   1    0 /sbin/init
   2    0 [clock]
   3    0 [crypto]
   4    0 [cam]
   5    0 [zfskern]
   6    0 [rand_harvestq]
   7    0 [mmcsd0: mmc/sd card]
   8    0 [mmcsd1: mmc/sd card]
   9    0 [mmcsd1boot0: mmc/sd]
  10    0 [audit]
  11    0 [idle]
  12    0 [intr]
  13    0 [geom]
  14    0 [sequencer 00]
  15    0 [usb]
  16    0 [mmcsd1boot1: mmc/sd]
  17    0 [pagedaemon]
  18    0 [vmdaemon]
  19    0 [bufdaemon]
  20    0 [vnlru]
  21    0 [syncer]
 101    0 [task: mx25l flash]
Note: It will show /sbin/init which isn't kernel process, but above command is good enough™

Pin all kernel processes only to CPUs 4 and 5:
Code:
for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do cpuset -l 4,5 -p $i ; don

Show to which CPUs kernel processes are pinned:
Code:
# for i in $(ps ax -e -o pid,ppid | awk '$2 == 0' | awk '{print $1}') ; do procstat cpuset $i ; done
  PID    TID COMM                TDNAME              CPU CSID CPU MASK
    0 100000 kernel              swapper              -1    1 4-5
    0 100009 kernel              softirq_0            -1    2 4-5
    0 100010 kernel              softirq_1            -1    2 4-5
    0 100011 kernel              softirq_2            -1    2 4-5
    0 100012 kernel              softirq_3            -1    2 4-5
    0 100013 kernel              softirq_4            -1    2 4-5
    0 100014 kernel              softirq_5            -1    2 4-5
    0 100015 kernel              if_io_tqg_0          -1    2 4-5
    0 100016 kernel              if_io_tqg_1          -1    2 4-5
    0 100017 kernel              if_io_tqg_2          -1    2 4-5
    0 100018 kernel              if_io_tqg_3          -1    2 4-5
    0 100019 kernel              if_io_tqg_4          -1    2 4-5
    0 100020 kernel              if_io_tqg_5          -1    2 4-5
    0 100021 kernel              if_config_tqg_0      -1    2 4-5
    0 100022 kernel              aiod_kick taskq      -1    2 4-5
    0 100023 kernel              deferred_unmount ta  -1    2 4-5
    0 100024 kernel              in6m_free taskq      -1    2 4-5
    0 100025 kernel              thread taskq         -1    2 4-5
    0 100027 kernel              kqueue_ctx taskq     -1    2 4-5
    ...
Column "CPU MASK" with value of "4-5" means that specific kernel thread can run on CPU cores 4 and 5.
 
depends, ideally is to build the whole world/kernel with gcc
i also tested libmd libzma but those made no diff
i test on arm6 though
 
Back
Top