Server freezes when using Git to update ports tree

no, current not running on that machine, which is also amd64 btw

or:
Code:
> freebsd-version -kru ; uname -aKU                             
13.2-RELEASE-p10
13.2-RELEASE-p10
13.2-RELEASE-p10
FreeBSD green.sau.si.pri.ee 13.2-RELEASE-p10 FreeBSD 13.2-RELEASE-p10 #0 releng/13.2
-n254661-a839681443b6-dirty: Thu Feb 15 00:01:47 EET 2024     root@green.sau.si.pri.
ee:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 1302001 1302001


current is only running in my qemu (reminds funny compiler issues there) and allwinner h3 board. i just told that, git, without config, is able to update smaller repos without something running out of memory. i'm surprised that git userland process is not what allocates a lot of ram. other remote repos are barely 1m or so. tho i also use git for my own code and config. but i admit ports is biggest

if it's absolutely required, i could let git bring my machine down again, while capturing resource usage stats remotely, but i would prefer it staying up and not flapping around all the time
 
… i tried setting arc max …

If you undo that setting, then does there remain any non-default value for a vfs.zfs.⋯ sysctl?

13.2-RELEASE-p10 …

I wonder whether the issue will be reproducible with 13.3 beta and 14.0-RELEASE.



cracauer@ maybe <https://freshbsd.org/freebsd/src?q=vnlru&committer[]=Mateusz+Guzik+(mjg)> 2023-10-10 onwards. At least some of those are not in 13.2-RELEASE, for example:


I associate those commits with fixes for excessive use of CPU (not of memory) by Git for ports.

<https://mail-archive.freebsd.org/cgi/mid.cgi?95e9f65b-58c4-ef3a-a5d1-b794179e4252> | <https://lists.freebsd.org/archives/freebsd-current/2023-September/004528.html> kernel 100% CPU, and ports-mgmt/poudriere-devel 'Inspecting ports tree for modifications to git checkout...' for an extraordinarily long time
 
no, i don't have any other tunables, see:
Code:
10:30,ketas@green:~> cat /boot/loader.conf
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
cryptodev_load="YES"
zfs_load="YES"

kern.vty="sc"

coretemp_load="YES"
acpi_hp_load="YES"
mac_portacl_load="YES"
10:30,ketas@green:~> cat /etc/sysctl.conf                                         
# $FreeBSD$
#
#  This file is read when going to multi-user and its contents piped thru
#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
#

# Uncomment this to prevent users from seeing information about processes that
# are being run under another UID.
#security.bsd.see_other_uids=0
vfs.zfs.min_auto_ashift=12

net.inet.ip.portrange.reservedhigh=0
security.mac.portacl.rules=uid:53:tcp:53,uid:53:udp:53,gid:7000:tcp:325,gid:7001:tcp
:326,gid:7002:tcp:327,gid:7003:tcp:328,gid:7004:tcp:19,uid:65534:tcp:80,uid:65534:tc
p:325,uid:65534:tcp:327
10:31,ketas@green:~>
 
no dedup. would be insane with 4g ram anyway? and benefits of that was also questionable i read

i could take limits off and let machine fail again. let vmstat run until sshd is killed off. it goes really fast. but from top run, i saw that swap was not used. all the memory was wired. so there was nothing to swap anyway. there was some minimal swapping only. as others said, they are more familiar with things swapping "normally". i mean things you can replicate with sh -c 'while true; do a="$a$a$a.aaa"; done'

here, i don't know what happens. is this a bug? just allocating memory is not a bug. even if it's kernel one. even allocating and running out is not a bug. but here i don't know. limits? automatic limits not working for just 4g of ram? people with 64g or ram could fit entire ports tree there without any issues

but still, is this something that needs fix? i don't even know what it is as i'm not familiar with what zfs actually does

It is very weird that you have no pageouts.

What git tree are you testing, specifically? What operation?
 
For a full new clone resident mem for ports goes to just above 2 GB. Then you have the buffer cache and potentially the ZFS cache on top.

But absolutely the memory that you see in "RES" is swappable. So i continue to be puzzled that you have no swapout in vmstat.

Both kernel caches should adapt to RAM size and availability, with the exception that the ZFS cache is fixed according to a calculation at startup time.

Can you try your git operation on UFS? Even on a USB device of some kind. I wonder whether you get some pathological behavior in ZFS. Also what @grahamperrin points out above.

Code:
  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
34230 cracauer     21  20    0  2652M  2037M uwait   44  23:32 681.93% git
 8087 cracauer      2  21    0   102M    87M STOP    28   0:09   0.00% emacs-29.2
 8066 cracauer      2  20    0    95M    81M STOP    20   0:02   0.00% emacs-29.2
 8377 cracauer      2  20    0    95M    80M STOP    38   0:02   0.00% emacs-29.2
 1830 cracauer      1  20    0    23M    10M select  23   0:04   0.01% sshd
 1828 root          1  40    0    23M    10M select  28   0:00   0.00% sshd
 1811 root          1  20    0    22M  9584K select  42   0:00   0.00% sshd
34224 cracauer      1  20    0    22M  7604K piperd  21   0:05   0.00% git
 1623 ntpd          1  20    0    23M  7472K select  43   1:47   0.01% ntpd
34143 cracauer      1  20    0    17M  6612K select  13   0:01   0.17% tmux
 6534 cracauer      2  20    0    24M  5620K select  36   1:10   0.00% gpg-agent
 6535 cracauer      2  68    0    20M  5544K select  19   0:02   0.00% scdaemon
 1831 cracauer      1  20    0    14M  5488K wait    20   0:01   0.00% bash
 
could do usb maybe but can i do ufs on zfs i wonder? zfs would still store data but it will not directly store files. assume problem is in latter

also what about those suggestions that tell to disable compression (lz4 there)?

also how could i test this without running out of memory? if i could somehow limit it to just bit below ram where it fails

here's top again too:
Code:
last pid: 16826;  load averages:  0.24,  0.22,  0.17        up 0+14:15:07  09:01:09
621 threads:   7 running, 576 sleeping, 38 waiting
CPU 0:  0.0% user,  0.0% nice, 15.5% system,  1.8% interrupt, 82.7% idle
CPU 1:  1.2% user,  0.0% nice, 19.0% system,  0.0% interrupt, 79.8% idle
Mem: 124M Active, 388K Inact, 182M Laundry, 3540M Wired, 10M Free
ARC: 410M Total, 137M MFU, 100M MRU, 2218K Anon, 2482K Header, 168M Other
     35M Compressed, 204M Uncompressed, 5.77:1 Ratio
Swap: 16G Total, 370M Used, 16G Free, 2% Inuse, 220K In, 68M Out

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31     0B    32K RUN      1 785:25  79.41% [idle{idle: cpu1
   11 root        155 ki31     0B    32K RUN      0 781:51  79.19% [idle{idle: cpu0
    6 root         30    -     0B  1376K CPU1     1   2:18  14.02% [zfskern{arc_evi
    8 root        -16    -     0B    48K pwait    1   0:12  12.81% [pagedaemon{dom0
    9 root        -16    -     0B    16K psleep   1   0:00   3.87% [vmdaemon]
    8 root        -16    -     0B    48K RUN      1   0:00   2.03% [pagedaemon{laun
 2898 openvpn      20    0    20M  5160K select   0   2:10   0.89% /usr/local/sbin/
   12 root        -88    -     0B   608K WAIT     0   0:09   0.81% [intr{irq28: ahc
   12 root        -88    -     0B   608K WAIT     0   0:09   0.77% [intr{irq27: ahc
    0 root        -16    -     0B  2192K -        0   1:28   0.55% [kernel{z_rd_int
   12 root        -100    -     0B   608K WAIT     0   2:35   0.50% [intr{irq20: hp
16687 ketas        20    0    17M  5512K CPU0     0   0:00   0.39% top -aHSPs1
    6 root         -8    -     0B  1376K zio->i   0   0:17   0.38% [zfskern{txg_thr
 3044 asterisk     20    0   200M    17M select   1   2:06   0.33% /usr/local/sbin/
    8 root        -16    -     0B    48K umarcl   1   0:00   0.30% [pagedaemon{uma}
 3044 asterisk     20    0   200M    17M uwait    0   1:39   0.28% /usr/local/sbin/
    0 root        -76    -     0B  2192K -        1   0:47   0.27% [kernel{if_io_tq
 1443 root        -16    -     0B    16K pftm     1   1:22   0.24% [pf purge]
 2625 bind         20    0   235M    49M kqread   1   0:11   0.17% /usr/local/sbin/
    0 root        -12    -     0B  2192K RUN      1   1:27   0.16% [kernel{z_wr_iss
 5437 ketas        20    0    76M    20M select   1   0:59   0.15% irssi{irssi}
16753 root         20    0   300M    34M RUN      0   0:00   0.13% /usr/local/libex
 4829 nobody       20    0    46M  9924K select   1   0:24   0.13% poe-daemon main
13592 ketas        20    0    15M  4008K select   1   0:00   0.12% tmux: server (/t
16826 root        -52   r0  2748K  2228K zio->i   0   0:00   0.11% sh -c date
16753 root         20    0   300M    34M zio->i   0   0:00   0.11% /usr/local/libex
16753 root         20    0   300M    34M zfsvfs   0   0:00   0.10% /usr/local/libex
 3044 asterisk     20    0   200M    17M nanslp   1   0:32   0.10% /usr/local/sbin/
16753 root         20    0   300M    34M zio->i   0   0:00   0.09% /usr/local/libex
 3044 asterisk     20    0   200M    17M select   1   0:31   0.09% /usr/local/sbin/
16753 root         20    0   300M    34M zfsvfs   0   0:00   0.09% /usr/local/libex
16753 root         20    0   300M    34M zio->i   1   0:00   0.09% /usr/local/libex
 5532  10300       20    0   123M  8312K select   0   0:23   0.08% /usr/local/sbin/
16753 root         20    0   300M    34M zfsvfs   0   0:00   0.07% /usr/local/libex
16753 root         20    0   300M    34M zfsvfs   1   0:00   0.07% /usr/local/libex
 2625 bind         20    0   235M    49M kqread   1   0:13   0.07% /usr/local/sbin/
    0 root         -8    -     0B  2192K zfsvfs   0   3:43   0.06% [kernel{arc_prun
16753 root         20    0   300M    34M zio->i   1   0:00   0.06% /usr/local/libex
16753 root         20    0   300M    34M zio->i   0   0:00   0.06% /usr/local/libex
   12 root        -60    -     0B   608K WAIT     0   0:18   0.06% [intr{swi4: cloc

the names are cut off but those are likely git ones. ram is 4g. wired is huge. this is last top screen before kill. it lagged a bit just as system that swaps suddenly. but i couldn't switch to another tmux window to ^c the git, it was already too late

it very likely wanted to go beyond 4g wired, which it couldn't. funnily, when this happens, and i don't let watchdog panic or reset, kernel keeps running there

so i indeed seem to run out of kernel memory i assume. and something does it
 
of of curiosity i used
sysutils/zfs-stats

that gave lot of values that are related to zfs, i looked at them:

Code:
vm.kmem_size: 4045062144
vm.kmem_size_max: 1319413950874

both are over my ram, i think, or not?:

Code:
hw.physmem: 4167294976
Code:
real memory  = 4299161600 (4100 MB)
avail memory = 4028526592 (3841 MB)

i could run top with that sorting but whenever i run git without limits machine goes down fast. maybe i could run git with some limits that are just about right

also, git worked before, it suddenly stopped working. i don't think major change happened in fbsd, definitely not in -p*, it was git that got upgrade

but should git be able to do it? and should kernel be able to allocate up all the ram? it saw kernel allocating ram up to point of less than 400m was free. i doubt that not many machines need this

i also don't know if i can limit kmem. if becomes full, will it panic? or just give error? and error to that? to zfs? to git?

right now, whatever git does, if i configure git with

Code:
> cat /usr/local/etc/gitconfig
[core]
        packedGitWindowSize = 128m
        packedGitLimit = 1g
        preloadIndex = false
[diff]
        renameLimit = 16384

that helps me

also, i wonder if i can do those tests in vm, but it will be even worse since then ram would be <4g obviously. it's annoying that every test brings machine down. after i traced issue to git, i don't really want to deliberately bring it down. i guess i could take another box and let it fail if this is needed to trace problem down

whole problem here is size of ram i guess. or half of the problem. many machines have loads of ram. even if git allocates a ton, and zfs, it won't fail

but should it fail at all? even if small memory?

by googling, i found lot of issues with git, not even on fbsd only. it says git eats resources. i knew it before too. but i assumed it would just use 100% cpu and swap whatever it needs and therefore be slow on this machine

actually git is not only one that takes it down, can't recall if i ran git at same time, before git itself starts taking machine down, but i also have this tool here which scans images. there are over 1t images 5mb in size. if i read exif from them all as fast as cpu and disk allows, it also seems like i'm on the edge. never realized to look how much wired is used back then

so it's not just git. i know zfs should not be used in low memory but should it just fail. it could as well allocate 128g or 1t, whatever the ram is in your machine, provided

i find it really hard to get what happens internally, but i have feeling that this should not happen

with vm, unless virtualization screws it up, it would be easier to capture it, but looks like i ran out of memory. not surprised, as wired was 3.6 and 4g ram

you tell me if that should happen

it's also not just my problem clearly. as i didn't start this topic

funnily, this escapes arc limits. maybe there should be more zfs limits. or kmem limits. how many machines are there that don't need sshd or shell running? or other things. like init

in the end, looks like zfs brings machine down

i have vague idea that this is some cache. but limiting cache would just make fs slow?

seems like here zfs just leaks all kernel memory off. or should i saw zfs just consumes every bit of ram. userland gets nothing

or is it outside of zfs? i find that area really complex too. generic (v)fs cache?

unless someone else gets there first, i could take another physical machine i could use solely to help to make this problem disappear in fbsd

maybe some test could be written here? iirc zfs has tests? unsure what git does, but it seems like it performs it's job very well. lately git managed to take machine down in every pull of ports main

now if we could make this to some deliberate test. maybe to take ram size of machine and just go over that somehow just to see what happens

i kind if refuse to believe that i'm the single person in the world. actually there were others earlier. but zfs should be used widely, and on fbsd too. are all those tuning their systems? or be really careful? why git triggers it? funnily i think there are others too

i'm wondering how to test it in this machine so kmem won't run out, just to see. or i don't know. i find kernel internals hard to get. fs is black magic. still is. i'm just pulling ideas out of my ass here

also, i kind don't want to run commands that lead to known crash here. maybe someone else can help. this is just generic low ram amd64 box. or i could just eventually put actual test machine up. which i should already have maybe, considering the first fbsd i installed was 4.6 and i have been running it since then. just meanwhile i had other things to do and that caused me to lose some in my home lab. just hw failed eh. but right now i just have one good machine here. it also runs network and has to work. hence the reluctance to do tests right now. unless they are non-destructive

btw, i had old 10.x machine here, in what i managed to permanently corrupt zfs. i hope that bug is now fixed? after many unclean shutdowns, it just panics 100% of the time. i didn't try to take it's pool yet and import it into newer machine to see of it's ok. i didn't know this could happen at all. wasn't zfs a thing where fsck is not needed? but there it felt like zfs-fsck was needed. or was it just kernel bug. strange eh. i hope those issues are gone now? catastrophic loss of pool?

so in the end, if you wait, i could have another machine and test things there. but isn't it faster if you run replicate it on your own. if you could? i don't have exotic hw

btw, even tho that this is just c2d with 4g ram, zfs here seems to be reasonably fast. i recall zfs was bad before, i mean on low end hw. so i can't really complain. just this unexpected issue bothers me
 
Your situation is not normal. While I don't have a 4 TB machine right now it should work.

I still think your need to do these things:
- try on UFS (USB device or whatever)
- show git in top. I don't care what you sort by, but your top output didn't
- show git output when idle. If your 4 GB are all wired up in idle the machine is effectively not capable of running any application on top
 
this is idle:
Code:
> top -HSPb -o res
last pid: 55529;  load averages:  0.23,  0.32,  0.31  up 4+08:17:58    10:24:37
622 threads:   4 running, 579 sleeping, 39 waiting
CPU 0: 21.1% user,  0.0% nice,  4.3% system,  0.1% interrupt, 74.5% idle
CPU 1: 27.6% user,  0.0% nice,  4.3% system,  0.2% interrupt, 67.9% idle
Mem: 1033M Active, 947M Inact, 43M Laundry, 1692M Wired, 141M Free
ARC: 479M Total, 190M MFU, 98M MRU, 1610K Anon, 3735K Header, 186M Other
     65M Compressed, 280M Uncompressed, 4.32:1 Ratio
Swap: 16G Total, 467M Used, 16G Free, 2% Inuse

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
55124 root         20    0  1944M   626M uwait    1   1:40   0.39% qemu-system-arm{qemu-system-arm}
55124 root         20    0  1944M   626M select   1   0:04   0.00% qemu-system-arm{qemu-system-arm}
55124 root         20    0  1944M   626M uwait    1   0:00   0.00% qemu-system-arm{qemu-system-arm}
55124 root         29    0  1944M   626M sigwai   1   0:00   0.00% qemu-system-arm{qemu-system-arm}
54932 root         20    0    81M    67M ttyin    0   0:01   0.00% csh
 2647 bind         20    0   280M    57M kqread   1   0:54   0.00% named{isc-net-0000}
 2647 bind         20    0   280M    57M kqread   0   0:45   0.00% named{isc-net-0001}
 2647 bind         20    0   280M    57M kqread   0   0:34   0.00% named{isc-net-0002}
 2647 bind         20    0   280M    57M kqread   0   0:30   0.00% named{isc-net-0003}
 2647 bind         20    0   280M    57M uwait    1   0:02   0.00% named{named}
 2647 bind         20    0   280M    57M uwait    1   0:01   0.00% named{isc-timer}
 2647 bind         20    0   280M    57M uwait    0   0:00   0.00% named{named}
 2647 bind         52    0   280M    57M sigwai   1   0:00   0.00% named{named}
43825 root         30    0    75M    53M ttyin    0   0:01   0.00% csh
 5575 ketas        20    0   112M    43M select   1   7:21   0.00% irssi{irssi}
 5575 ketas        20    0   112M    43M select   1   0:00   0.00% irssi{gmain}
55374 ketas        38    0    53M    37M pause    0   0:00   0.00% csh
55240 ketas        20    0    53M    37M pause    0   0:01   0.00% csh
Code:
> top -b -o res                                       
last pid: 55627;  load averages:  0.39,  0.32,  0.30  up 4+08:21:14    10:27:53
201 processes: 1 running, 200 sleeping
CPU: 24.4% user,  0.0% nice,  4.3% system,  0.1% interrupt, 71.2% idle
Mem: 1033M Active, 949M Inact, 43M Laundry, 1693M Wired, 138M Free
ARC: 480M Total, 188M MFU, 102M MRU, 896K Anon, 3735K Header, 186M Other
     65M Compressed, 280M Uncompressed, 4.31:1 Ratio
Swap: 16G Total, 467M Used, 16G Free, 2% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
55124 root          4  20    0  1944M   627M select   0   1:47   0.49% qemu-system-arm
54932 root          1  20    0    81M    67M ttyin    0   0:01   0.00% csh
 2647 bind          8  52    0   280M    59M sigwai   1   2:47   0.00% named
43825 root          1  30    0    75M    53M ttyin    0   0:01   0.00% csh
 5575 ketas         2  20    0   112M    44M select   0   7:21   0.00% irssi
55374 ketas         1  38    0    53M    37M pause    0   0:00   0.00% csh
55240 ketas         1  20    0    53M    37M pause    1   0:01   0.00% csh
54922 ketas         1  45    0    53M    37M pause    1   0:01   0.00% csh
43813 ketas         1  47    0    53M    36M pause    0   0:00   0.00% csh
53075 ketas         1  32    0    53M    33M pause    1   0:01   0.00% csh
99339 ketas         1  20    0    28M    17M select   1   0:07   0.00% tmux
31762 asterisk     66  20    0   196M    16M select   0  21:25   0.00% asterisk
 2845 root          1 -52   r0    13M    13M nanslp   0   0:02   0.00% watchdogd
 5837  10300        8  20    0   127M    12M select   0   4:05   0.00% bitlbee
55017 ketas         1  20    0    21M  9624K select   0   0:00   0.00% sshd
 8972 mrtg          1  52    0    48M  9404K nanslp   0   5:19   0.00% perl
55013 root          1  23    0    21M  9328K select   0   0:00   0.00% sshd
51978   9000        1  20    0    54M  9028K kqread   1   0:00   0.00% pickup
git -C /usr/ports/ pull
Code:
> top -SPb -o res 30
last pid: 55962;  load averages:  0.39,  0.30,  0.28  up 4+08:29:28    10:36:07
228 processes: 2 running, 225 sleeping, 1 waiting
CPU 0: 21.1% user,  0.0% nice,  4.3% system,  0.1% interrupt, 74.5% idle
CPU 1: 27.6% user,  0.0% nice,  4.3% system,  0.2% interrupt, 67.9% idle
Mem: 1101M Active, 953M Inact, 43M Laundry, 1613M Wired, 148M Free
ARC: 365M Total, 173M MFU, 48M MRU, 1857K Anon, 5722K Header, 137M Other
     30M Compressed, 192M Uncompressed, 6.29:1 Ratio
Swap: 16G Total, 467M Used, 16G Free, 2% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
55124 root          4  20    0  1948M   628M select   0   1:54   0.29% qemu-system-arm
54932 root          1  20    0    81M    67M ttyin    0   0:01   0.00% csh
 2647 bind          8  52    0   280M    60M sigwai   1   2:48   0.00% named
43825 root          1  30    0    75M    53M ttyin    0   0:01   0.00% csh
 5575 ketas         2  20    0   112M    44M select   1   7:22   0.00% irssi
55725 root          1  20    0   258M    38M zio->i   0   0:06   0.88% git
55717 root          1  43    0   258M    38M wait     0   0:00   0.00% git
55374 ketas         1  38    0    53M    37M pause    0   0:00   0.00% csh
55240 ketas         1  20    0    53M    37M pause    0   0:01   0.00% csh
54922 ketas         1  45    0    53M    37M pause    1   0:01   0.00% csh
43813 ketas         1  47    0    53M    36M pause    0   0:00   0.00% csh
53075 ketas         1  20    0    53M    33M ttyin    0   0:01   0.00% csh
99339 ketas         1  20    0    33M    19M select   1   0:07   0.00% tmux
31762 asterisk     66  20    0   196M    17M select   0  21:30   0.29% asterisk
 2845 root          1 -52   r0    13M    13M nanslp   1   0:02   0.00% watchdogd
 5837  10300        8  20    0   127M    12M select   1   4:06   0.00% bitlbee
55954 note8         1  20    0    30M    10M select   1   0:00   0.10% sshd
55017 ketas         1  20    0    21M  9624K select   1   0:00   0.00% sshd
 8972 mrtg          1  52    0    48M  9404K nanslp   0   5:19   0.00% perl
55013 root          1  23    0    21M  9328K select   0   0:00   0.00% sshd
55950 root          1  22    0    21M  9328K select   0   0:00   0.00% sshd
 5030 nobody        1  20    0    46M  9060K select   1   2:51   0.00% perl
51978   9000        1  20    0    54M  9028K kqread   1   0:00   0.00% pickup
 5027 nobody        1  20    0    46M  8844K select   1   2:42   0.00% perl
55007 ketas         1  20    0    21M  8040K select   1   0:00   0.00% sshd
55005 root          1  24    0    21M  7748K select   1   0:00   0.00% sshd
55961 note8         1  20    0    94M  7568K select   1   0:00   0.10% rsync
54978 openvpn       1  20    0    18M  7472K select   0   0:02   0.20% openvpn
55960 note8         1  22    0    21M  6332K zio->i   1   0:00   0.88% rsync
54407   9002        1  20    0    19M  6328K kqread   0   0:00   0.00% imap-login

yes, i see problem here. the problem i have is not showing up. that's because i run git with config i got from this thread. but it at least shows that git is not allocating memory. at least right now. nothing really came from ports. the thing that really crapped everything was big update from ports. i found it out later when i was finally. i was able to clone ports into that very same machine, without any limits, year ago with whatever git version i had then. something in pull does something with zfs? with large changes? something is really wrong and i don't know why and what. if i want to simulate large update, i wonder how to do it? get older branch? unsure

could try ufs too but looks like big specific git update of ports did pass and now git maybe works again. so i can't test it now. but as you see i'm not alone. it's all weird

yeah, the top i captured right before it ran out of memory was sadly without git actually showing up, git was running there but since i was using -aH it didn't show indeed. so that was my mistake. but problem exists still
 
But you have virtual machines taking up all your RAM. How do you expect to operate multi-GB git trees under those conditions?
 
just one vm, 1g ram there, i can kill it and it's same. but swap is below 500m. so it's low memory but clearly works enough? i bet that swap is largely qemu too, as that's down to under 200m if i kill it

still, what does git do? wha do git limits actually do?

and relation of zfs here?

ports is 30g indeed, yet i'm able to limit git. i don't know why git itself won't do this. or fbsd

again, what's the point of running totally dry? maybe there should be a cap somewhere. 80-90% of ram?

like if 90% memory is wired, maybe don't allocate more. let fs be slow instead? because i suspect it's a some sort of caching. i don't know. i found that with git you can easily run out of vmem. but wtf is vm anyway. and it's unswappable one apparently

hell knows. you tell me why git wouldn't simply just swap. everyone knows how swapping looks like. and usually you can save it if you see it. or it just dies. hopefully the offender

but here it's different. funnily zfs otherwise kind of works. even in low ram condition. i watched what arc does. sometimes goes over 1g, then goes back down. nothing gets killed

but git? fuck knows really. maybe it should suck at low memory. tho i wonder how much it needs?

so yeah, works now

still, i don't know. feels like both git and kernel could need new limits. can't use what you don't have. a low ram is just making it worse i know

and zfs is beast. some people squeeze it zfs into platforms with 512m ram. probably not git but still. old sun engineers blurt their coffee out over this i guess

in the end, does this need a fix? does 14 or 15 already have it? would it be reasonable to reserve some memory solely to userland? is there something i can set now? could it be default? where does the memory even go? right now i have arc like 900m and wired is ~2g, where does the 1g go? similarly, if i was 3.5g wired and arc was 500m, where the hell did 3g go and why?

i have no idea. where does kernel stuff it all. can't it give it back?

can't it just say i'm sorry dave i'm afraid i can't do this

not a first funny story here. i also have issue where i lost swap device from machine. it came back but then i had 3 swap devices attached. one of them got label from other partition, swap was 8g, partition was 150g. this was never used

i guess i could use vm to fuzz things and then maybe report it. btw i was told that mixup within geom devices is impossible. so maybe there's geom bug

also, it's non-ecc, but i checked it. hopefully it's not hw. it would show maybe? like i have no other issues

so, i don't know that to think of it all
 
Well, you have 4 qemu processes all with precisely 626M resident. This might or might not indicate that it is shared memory (possibly forked processes). If it's not shared that is your problem right there.

As you correctly say, one big question is where all the wired memory comes from. The other is why you have no pageout going on although you have swapspace, a memory shortage, and resident memory that looks swapable to me.

You will have to do some experimentation to find out:
- gather `top -o res` in single user mode, but after mounting all filesystems read-write (zfs import -a)
- gather `top -o res` after exiting your existing RAM users (VM, named etc) until you can make git appear in top -o res

I would do a clean startup commenting out all demons and VMs, then manually adding them back while observing top.

What FreeBSD version did you say did better on this workload?
 
just one vm, but my favorite top commandline is
Code:
top -qaHSPs1
i also ran
Code:
jot 0 | xargs time -h sh -c 'while true; do s="$s${s}s"; done'
for test, and i got
Code:
55001: Out of space
        5.06s real              2.57s user              2.42s sys
60001: Out of space
        5.04s real              2.47s user              2.52s sys
65001: Out of space
        5.04s real              2.60s user              2.39s sys
so userspace memory alloc works, after which top was
Code:
last pid: 90412;  load averages:  1.42,  0.90,  0.68  up 4+22:20:59    00:27:38
613 threads:   6 running, 568 sleeping, 39 waiting
CPU 0: 18.8% user,  0.0% nice,  4.2% system,  0.1% interrupt, 76.9% idle
CPU 1: 24.5% user,  0.0% nice,  4.2% system,  0.2% interrupt, 71.1% idle
Mem: 1368M Active, 1325M Inact, 41M Laundry, 1030M Wired, 90M Free
ARC: 316M Total, 152M MFU, 40M MRU, 896K Anon, 2754K Header, 121M Other
     25M Compressed, 167M Uncompressed, 6.61:1 Ratio
Swap: 16G Total, 873M Used, 15G Free, 5% Inuse
i agree that lot of things run there, but i'm not swapping, cpu is idle and load is low too

as for wired, i have no idea, i have ton of zfs datasets, all are mounted too
Code:
> zpool status -vv ; zpool list
  pool: copper
 state: ONLINE
  scan: scrub repaired 0B in 07:56:29 with 0 errors on Fri Feb 16 13:18:44 2024
config:

        NAME           STATE     READ WRITE CKSUM
        copper         ONLINE       0     0     0
          gpt/copper0  ONLINE       0     0     0

errors: No known data errors

  pool: gold
 state: ONLINE
  scan: scrub repaired 0B in 08:36:45 with 0 errors on Mon Feb 19 14:11:18 2024
config:

        NAME           STATE     READ WRITE CKSUM
        gold           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            gpt/gold0  ONLINE       0     0     0
            gpt/gold1  ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:32:55 with 0 errors on Mon Feb 19 06:07:24 2024
config:

        NAME          STATE     READ WRITE CKSUM
        zroot         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            gpt/zfs0  ONLINE       0     0     0
            gpt/zfs1  ONLINE       0     0     0

errors: No known data errors
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTRO
OT
copper  3.62T  3.51T   116G        -         -    12%    96%  1.00x    ONLINE  -
gold    10.9T  4.52T  6.39T        -         -     1%    41%  1.00x    ONLINE  -
zroot    141G  58.5G  82.5G        -         -    42%    41%  1.00x    ONLINE  -
Code:
> zfs list | wc -l
     124
maybe too much maybe not for this box

still, it does work. but this is zfs related i think. zfs related, low ram just brings it out? apart from it, the box is weak, but it doesn't do any work either. yeah i could do single user tests later sometime. or maybe i figure out what git does. but good stress test it seems like

so i don't know
 
other data
Code:
> ifconfig -l
em0 lo0 bridge0 bridge1 tap4 tap5 tap6 tap10000 tap10001 tap10002 tun20 lagg0 epair3
a epair3b epair10000a epair10001a epair10002a epair10003a epair10004a vlan3 vlan4 vl
an7 vlan9 vlan40 pflog0 epair10005a
> jls name | wc -l
       6
> ps wwaux | wc -l
     220
there's pf and openvpns there and stuff. could do more ram. but still. just git does something. so special fs use? find / from periodic doesn't do anything, etc. so maybe my wired is solved. but zfs is fishy. at least it's better now!
 
Code:
> cat kmem.sh
#!/bin/sh -Cefu


set -Cefu


text="`kldstat | awk 'BEGIN{print\"16i 0\";}NR>1{print toupper($4)\"+\"}END{print\"p
\"}' | dc`"
data="`vmstat -m | sed -Ee '1s/.*/0/;s/.* ([0-9]+)K.*/\1+/;$s/$/1024*p/' | dc`"
total="$(($data + $text))"

echo "text=$text, `echo $text | awk '{print$1/1048576\" MB\"}'`"
echo "data=$data, `echo $data | awk '{print$1/1048576\" MB\"}'`"
echo "total=$total, `echo $total | awk '{print$1/1048576\" MB\"}'`"
> ./kmem.sh                                             
text=39273049, 37.4537 MB
data=242628608, 231.389 MB
total=281901657, 268.842 MB
 
For awareness:
  • page 5, there's a link to progress on reduced memory footprints.
NB if you use gitup for a copy of a repo, you can not then use Git on the same copy.
 
unsure how this helps. i'm not going to fix it. even if i find wtf is inside vm. i'm not able to. i'm not even able to read specific internals without my brain blowing up. i suspected it's some kind of cache. but why is cache going over the edge? the ufs test is pending because can't really put it anywhere except zvol. gitup is too limited i think. yet. i wonder if i could trace it. and should it be better? i fixed it for now. also i'm not alone here. i could get another machine to test it i guess. if i'm alone with this and no other ways exist. i'm still puzzled what git does. it can only read and write files. and ask them to be cached for was access? should it? should git do it? should kernel let it? i showed how git looks in top. it's not allocating anything that top would show. it does something else that blows up. i don't see it, i can't find it, can't fix either. i'll let others do it. they have fixed many zfs issues over years. assuming it's zfs. but what else it could be? i wonder if small vm helps here. unsure, maybe it can't be fixed. but then, maybe i should look what my git config actually does. also it's all really difficult. you tell me if it's possible to clamp it down so there is no crash. well one could say it's crash. unsure. i actually found a workaround as you see. git, with those options, will no longer do any harm. i don't know what they do on zfs. also, even if i do test on ufs, i need some sure way to cause issue to appear. i still hope there's some fix to this. can't it just deny or refuse to use more memory without running totally out if it? zfs is famous memory eater i know. and maybe fixes hurt performance. but i hope they don't

as i said, i fixed the issue here. i'm looking for more permanent solution

could it be many small files? it that could be limited, would it fail or be slow? yeah i don't know. there's reason i leave fs to others
 
So I configured a machine to have 4 GB RAM and did a full git clone of ports. Nothing special happened.

I then configured it down to 512 MB and did the same clone. It didn't go through, but it ended as expected, on a network timeout. The machine stayed more or less responsive during, vmstat and top showed significant pageout (and even more pagein).

I think my top item that still puzzles me is that your vmstat showed no pageout. My experiment clearly shows that a single git load pages as expected.
 
i have this feeling that if i remove my git system config, it works again. just as it did before. it was certain update that pushed it over. it's not clone. extra 2g wired appeared out of nowhere. i have no idea what's the worst thing one can do on lz4 compressed dataset on zfs. and if git pull can do it. i would suggest you to keep that machine around and try again. unless you figure out what's the actual bad thing. didn't zfs contain tests for that also? i have no idea either. there was certain pull from ports that took machine down 100% of times. when i found limiting git config, there was iirc loads of small changes on tons of files. i sadly didn't keep the hashes. i don't know how to replicate it. all git can do is read write create unlink files. zfs does cow, so i'm unsure if this "helps" as well. or compression. i have no idea. maybe i'll figure out how to replicate it on my own. i have no idea what git does. there must be some file operation that git does, and system happily performs, until it's whoops, i don't have any memory left. that doesn't cause anything "bad", as machine will reply to pings and everything. just everything else is gone. so yeah, sad you couldn't replicate it. funnily i'm able to use zfs in this 4g box, it's slower but generally no issues. i wish i'm able to find a culprit. and then we could see if it's fixable or not. but what about earlier issues in this thread, and solutions to fix this? i can't comment on those, i don't know what they do. it fixed things for me. people up there reported no ufs issues. and it specifically said update, as like intermittent issue. unsure how it "froze" or what was the ram. so yeah, sad if can't even find problem here. if this is sure thing that happens that can't be done with so little ram, i at least want to know what's the thing i can't do. like without going really deep into fs dev. oh well. i'm out of ideas here
 
Back
Top