Server freezes when using Git to update ports tree

Recently I have started to encounter a phenomenon I don't understand.

I have one virtual server running 13.1-RELEASE-p7 with ZFS, 2 GB RAM and 40 GB SSD space. Since switching to Git for updating the ports tree, the server enters an unrecoverable freeze state whenever I run git -C /usr/ports pull, I have to power-cycle to get it back into working order.

Other identically specced instances do not exhibit this behaviour. I can successfully delete the ports tree and clone it again without the server freezing. Smaller updates appear to go through but the combined changes of a couple of days lead to the freeze. From what I can tell it's exclusively this Git process that grinds it to a halt, no other issues have occurred, it's running stable as long as I do not update the ports tree via Git. Removing the repository and using portsnap instead of Git works without issue as well. Updating other large-ish Git repositories poses no problem as well, it's just the ports repo.

There is nothing in /var/log/messages and the remote console shows an unresponsive, frozen OS, unable to interact with.

Does anyone have a clue for me what may be going on, encountered the same or has suggestion towards troubleshooting this further?
 
running 13.1-RELEASE-p7 with ZFS, 2 GB RAM
Smaller updates appear to go through but the combined changes of a couple of days lead to the freeze.
I suspect you might be running out of memory really quickly when using git. In a way that causes both ARC and your git process to battle for the memory, locking everything up until this is resolved.
 
I suspect you might be running out of memory really quickly when using git.
Sounds plausible and I'd usually suspect the same, if not for the fact that other identical instances do not behave like this. Updating is rather slow (which is to be expected with a large repo) but never freezes the machines. Also, wouldn't FreeBSD simply terminate processes taking up too many resources instead of locking up without throwing an error?

Forgot to mention that I have vfs.zfs.arc_max="200M" set in /boot/loader.conf.
 
if not for the fact that other identical instances do not behave like this.
Swap configured? Maybe ARC has been limited on the working system?

Also, wouldn't FreeBSD simply terminate processes taking up too many resources instead of locking up without throwing an error?
ARC can sometimes release memory a bit slow, so from the point of view of the kernel there's no shortage yet. ARC and MySQL are notoriously bad at this and you really need to limit ARC to allow MySQL to grow. Or else it can be a bit of a slugfest between them both.

If you're that strapped for memory on that system, you might want to try net/gitup instead.
 
Swap configured? Maybe ARC has been limited on the working system?

Yes and yes, ARC has been limited on all systems, including the non-working one.
Code:
Device          1K-blocks     Used    Avail Capacity
/dev/da0p2        2097152        0  2097152     0%
Code:
vfs.zfs.arc_max: 209715200

ARC and MySQL are notoriously bad at this and you really need to limit ARC to allow MySQL to grow.
I have been bitten by this working with PF and large tables on systems of all memory configurations.
 
Well, all I can say is that I wouldn't even attempt to build from ports on a system that low on memory. Why aren't you simply using packages? If you need specific options (or default versions), then I would do all the building on a big system at home.
 
Well, all I can say is that I wouldn't even attempt to build from ports on a system that low on memory. Why aren't you simply using packages? If you need specific options (or default versions), then I would do all the building on a big system at home.
I'm using packages for almost everything. I just need some Nginx modules that are not part of the default package. To the best of my knowledge you cannot mix-and-match package sources without major issues. Having to build every single required resource to get a consistent index, including multi-hour Rust builds for minor updates is rather inconvenient and overkill. Plus, not everyone (e.g. me) has access to a beefy machine, especially when considering the current energy prices (in Germany). For the moment, I just pkg freeze nginx and update it separately via ports. If you know of a better update process, I'm quite open to suggestions.

Then again, I'm not having issues with building the ports I need, the issues are with using Git to update the ports tree in the first place. Ideally, I'd like to understand what's going on to be in a position to troubleshoot similar issues in the future and simply to learn.
 
Having to build every single required resource to get a consistent index, including multi-hour Rust builds for minor updates is rather inconvenient and overkill.
That's not how poudriere works, and with the -devel version you can actually tell it to use the official packages when it's appropriate.
Code:
     -b branch
              Fetch binary packages from a binary package repository instead
              of building them.  The branch argument can be one of the
              following: latest, quarterly, release_X (where X is the minor
              version of a release, e.g., “0”), or url.

Plus, not everyone (e.g. me) has access to a beefy machine, especially when considering the current energy prices (in Germany).
You don't need a "beefy" machine. My build server is an old Core i5-3470 (2 cores, 4 threads) with 16GB of memory and it has been building all my packages just fine. Saves me from doing it on my VPS directly.
 
That's not how poudriere works, and with the -devel version you can actually tell it to use the official packages when it's appropriate.

I was under the impression that there is a long-standing bug preventing this from working reliably in practice. The documentation says it should work but last time I checked, the setting was broken. Is this issue indeed finally fixed? Can you confirm that you are successfully using Poudriere in this manner? If so, that’d be perfect.

My build server is an old Core i5-3470 (2 cores, 4 threads) with 16GB of memory

The only machine I have access to that’s not my laptop is a Celeron 2-core with 8 GB of RAM that can do little else while compiling software via Poudriere. It takes upwards of 7 hours at full load to compile the dreaded Rust, if it even finishes, just to in turn compile ripgrep. If I just needed to compile Nginx while effectively doing a download and pass-through on the other packages, that would be no issue though.

The Poudriere approach would also save ~2 GB of space on the VMs by removing the ports tree.
 
I consider the original hang still unexplained. Tend to blame hardware.

If it was just git memory then the OOM killer should have kicked in. ZFS buffering? A bit more complicated, but even if you trigger a kernel panic then thingie should reboot, not hang. Unless it is (by accident?) configured to sit in ddb forever on a panic.
 
I consider the original hang still unexplained. Tend to blame hardware.
Same here. Since we're speaking of a virtual machine sitting in a big datacenter, I cannot quite imagine a hardware issue, especially when it only occurs in this specific scenario. Still, wouldn't be the first time unexplained issues occur due to flaky hardware somewhere in a setup.

All VMs use the stock GENERIC kernel, so accidental misconfiguration would have to be a global issue. A kernel panic, especially when not leading to a reboot, would dump a panic message on the console and drop into the kernel debugger, correct? Just verified that it is in fact enabled via debug.debugger_on_panic: 1.

However, there's nothing. The machine does not react to (simulated) input at all. External monitoring graphs for the VM go flat the moment the freeze happens: no CPU, disk or network activity.

I can quite reliably recreate the issue every couple of days, whenever the ports tree changes are large enough. Is there anything I can do to provide additional, potentially useful information?
 
A kernel panic, especially when not leading to a reboot, would dump a panic message on the console and drop into the kernel debugger, correct?
Could be a deadlock (or a livelock). That won't produce a panic.
 
Same here. Since we're speaking of a virtual machine sitting in a big datacenter, I cannot quite imagine a hardware issue, especially when it only occurs in this specific scenario. Still, wouldn't be the first time unexplained issues occur due to flaky hardware somewhere in a setup.

All VMs use the stock GENERIC kernel, so accidental misconfiguration would have to be a global issue. A kernel panic, especially when not leading to a reboot, would dump a panic message on the console and drop into the kernel debugger, correct? Just verified that it is in fact enabled via debug.debugger_on_panic: 1.

However, there's nothing. The machine does not react to (simulated) input at all. External monitoring graphs for the VM go flat the moment the freeze happens: no CPU, disk or network activity.

I can quite reliably recreate the issue every couple of days, whenever the ports tree changes are large enough. Is there anything I can do to provide additional, potentially useful information?

Do you have access to the console and hence to the debugger?

I would run a `vmstat 1` into a logfile just to rule out memory overload. You do have swapspace, right?

Can't you just ask the hoster to move the VM to a different host? Maybe it is a hanging SSD. Those things happen and don't show up in SMART.
 
ZFS buffering? A bit more complicated, but even if you trigger a kernel panic then thingie should reboot, not hang.
I've had (very rare) hangs with ZFS in the past under memory pressure. Assumption is it wasn't actually related to ZFS but to encrypted swap, in the weird case that GELI needed to allocate memory while the pager tried to page out to make that possible, which of course required GELI to move on ... but I can't prove that. All I can say it didn't happen again since I removed swap encryption on that machine....
 
I would run a `vmstat 1` into a logfile just to rule out memory overload. You do have swapspace, right?

Just did that. Even a minor ports tree update affects the server.

Code:
procs     memory       page                      disks     faults       cpu
r b w     avm     fre  flt  re  pi  po    fr   sr da0 cd0   in   sy   cs us sy id
 0  0  0  972284  885616   526   0   0   0   534   21   0   0   14   958   402  1  0 99
 0  0  0  972284  885616   162   0   0   0     0   10   0   0    5   495   263  0  0 100
 1  0  0  961396  885616    64   0   0   0   121   10   0   0    2   427   240  1  0 98
 2  0  0  961396  885616    12   0   0   0     0   11   0   0    1   425   255  0  0 100
 1  0  0  961396  884896   287   0   0   0   288   10   0   0   22  2708   513  2  1 98
 1  0  0  972284  884896   638   0   0   0   559   11  99   0   73  3467   840  1  1 97
 1  0  0  994668  881296  7213   0   0   0  7032   18  12   0   21 13730  1487  4  2 94
 0  0  0  984324  881296   213   0   0   0   121   20   0   0   21   648   368  1  0 98
 0  0  0  995468  880580   905   0   0   0   770   20   0   0   22  4810  1064  1  1 98
 2  0  0  995468  880580    28   0   0   0     0   20   0   0   11   688   381  0  0 100
 1  0  0  997612  880340   586   0   0   0   465   22  90   0   82  2912   900  2  1 97
 0  0  0  984580  880340    86   0   0   0   232   20   0   0   23  3282   645  1  1 98
 3  0  0  984580  880340    16   0   0   0     0   20   0   0    1   417   257  1  0 99
 1  0  0  984580  880340    13   0   0   0     0   22   0   0    2   452   263  1  0 99
 1  0  0  995468  880340   351   0   0   0   273   20   0   0    1   693   263  0  0 100
 1  0  0  995468  880340    29   0   0   0     0   20  87   0   49   469   618  0  0 100
 0  0  0  984580  880580   147   0   0   0   121   20   0   0    0   440   252  1  0 99
 1  0  0  984580  880580    13   0   0   0     0   22   0   0    1   410   240  0  0 99
 1  0  0  984580  880580    13   0   0   0     0   20   0   0    2   471   275  0  1 99
 0  0  0  995468  880580   351   0   0   0   273   20   0   0    0   688   227  0  0 100
procs     memory       page                      disks     faults       cpu
r b w     avm     fre  flt  re  pi  po    fr   sr da0 cd0   in   sy   cs us sy id
 1  0  0  995468  880580    66   0   0   0     0   22  77   0   52   747   676  2  0 97
 4  0  0  984580  880580   171   0   0   0   121   20   2   0   27  1551   636  2  1 98
 1  0  0 2397684  798340 20276   0 138   0  1053   36   3   0   25  4608   675  3  3 94
15  0  0 1292728  779540 28089   0  34   0 18822   40 4874   0 2831 46640 42806 16 36 48
 1  0  0 1281840  692824    77   0   0   0   121   44 5758   0 3541 53788 657592  2 69 29
 4  0  0 1281840  553720    16   0   0   0     0   40 899   0  544  4907 4735515  0 55 45
 0  0  0 1281840  315928    16   0   0   0     0   40 1218   0  882  8867 4741313  0 54 46
 1  0  0 1281840  119704    14   0   0   0     0   40 938   0  675  6668 4579946  2 52 47
 0  0  0 1281840   68480    16   0   0   0 36517   40 855   0  622  6844 5082931  1 53 46

To me, this indeed looks like running out of memory fast. After the last entry the server froze.

Still, it puzzles me why the system wouldn't kill the process and enters a permanent unresponsive state instead — and it works on identically specced instances with similar load and memory usage. No amount of waiting resolves this, tried this before. The system is dead the moment this state occurs.
 
if memory is needed by the zfs subsystem swap wont do much if you dont have any pigs that can be killed or paged out
i have a system with zfs and 1GB ram that does not run anything besides sshd which i use as remote backup
large zfs receives will fail randomly with out of memory swap or no swap
in the end i exported the volume via iscsi and attached the pool elsewhere with more memory
the problems was worse on 12.x
however it never locked / panic / etc
 
Yeah, "sr" is scan rate (or looking for pages) and it shoots up in the last line. That could indicate a system in trouble, although I do not know the threshold in absolute numbers that are normal or not.

Interestingly it survives the actual swapout phase ("po") that occurred a couple seconds before.

I don't think this is telling us much more. But you could run the same vmstat on the other machines when they do the git update so that we can compare.
 
But you could run the same vmstat on the other machines when they do the git update so that we can compare.

Nice, now that Git update freezes at least one other machine as well. Here's the log until it froze, same exact behaviour as with the other one:

Code:
procs     memory       page                      disks     faults       cpu
r b w     avm     fre  flt  re  pi  po    fr   sr da0 cd0   in   sy   cs us sy id
 1  0  1 3706140  404460   257   7   2   1   297  222   0   0   14   523  1080  1  0 98
 0  0  1 3706180  404208   867   0   0   0   620   88   1   0    8   862   311  1  0 99
 0  0  1 3695340  404208    87   0   2   0   121   80   6   0    5   576   284  0  0 100
 1  0  1 3684452  404208    68   0   0   0   121   80   1   0    8   580   273  1  0 99
 0  0  1 3684452  404208    59   0   0   0     0   88  96   0   61   503   932  1  1 98
 0  0  1 3684452  404208    25   0   0   0     0   80   3   0   29   948   484  2  0 98
 0  0  1 3931160  324520 17658   0 182   0  1319  100 359   0  368  4369 13516  3  6 91
 1  0  1 3989980  285500  4228   0 582   0  2245  110 782   0  761  6552 32760  8  9 84
 0  0  1 4003232  180156 20036   0 1348   0 13678  111 3711   0 2206 19193 212330 22 26 52
13  0  1 4003232  136456    24   0   0   0   308  120 4633   0 2427 27905 1739013  2 69 30
 0  0  1 4003232  132312    18   0   0   0     0  120 1733   0 1033 13194 1288023  1 57 42
 0  0  1 4003232   38480    13 6722   0   0 21248 12960 860   0  583  6416 3906963  0 54 46
 1  0  1 4014120   48944   366 22682   5 425 46292 175029 1247   0  675  6284 3519775  1 56 43
 1  0 22 4014496   22788   216 1441   2 1462 34097 64835 2200   0  935  5946 3501290  0 57 43
 0  0 22 4014120   33904   365 2612  11 3592 40292 92607 4258   0 1436  3912 3589029  0 59 41
 0  0 24 4014120    6948    83 469   0 11461 28796 83227 11643   0 3371  3972 4884678  0 59 40
 1  0 24 4003232   67620    72  12   0 4575 45183 491539 10317   0 4081 35153 2453979  2 71 27
 0  0 25 4014032   20048   233  69  14 295  7492 14183 684   0  331  2849 1778101  0 55 44
 1  0 25 4014416   23272   345  99  16 3086 34285 551092 3635   0 1317  4707 5012763  0 62 37
 
Similar pattern. Survives a pageout burst but afterwards goes into a high scan rate and dies. Although this one survived higher scan rates.

Can you possibly test this on UFS2?
 
While waiting to find a solution to the problem, here is a script to update what you need in the port tree.
Once your problem is solved, you will probably need to run git restore first.

Bash:
#!/bin/sh

fetch_file() {
    local file="$1"

    mkdir -p ${PORTSDIR}/${file%/*}
    fetch -o "${PORTSDIR}/${file}" "${URL}${file}?h=${BRANCH}"
}

fetch_dir() {
    local dir="$1" file

    for file in $(fetch -qo- "${URL}${dir}" | grep -Eo "${dir}/[[:alnum:]._-]+/?")
    do
        case ${file} in
        */) fetch_dir ${file%/} & ;;
        *)  fetch_file ${file} ;;
        esac
    done
    wait # fetch_file
}

# USAGE
# BRANCH=2023Q1 sh update_port.sh www/nginx
# PORTSDIR=/my/tree sh update_port.sh www/nginx
# sh update_port.sh www/nginx

: ${PORTSDIR:=/usr/ports}
: ${BRANCH:=main}
URL=https://cgit.freebsd.org/ports/plain/

# Upgrade build depends
# pkg upgrade -y

# Update $PORTSDIR/Mk - Add $PORTSDIR/Templates $PORTSDIR/Keywords if needed
# We don't need all files in there but to make sure what we need is up to date, we take the all directory
job=0
for f in $(find "${PORTSDIR}/Mk" -type f); do
    fetch_file "${f#${PORTSDIR}/}" &
    job=$((job + 1))
    if [ ${job} -eq 8 ]; then # 8 is fine
        wait
        job=0
    fi
done

for origin; do
    fetch_dir ${origin}
done
 
Yes, that is what I have in mind. It would be interesting to rule out ZFS, or not.

Just tested an update with the UFS volume. It went through without freezing the server for now. Will continue testing and logging.

Code:
procs     memory       page                      disks     faults       cpu
r b w     avm     fre  flt  re  pi  po    fr   sr da0 da1   in   sy   cs us sy id
 1  0  0 1003116  127260   723  61   1   0   824  313   0   0   85  1692   951  4  1 96
 0  0  0 1026168  125112  1124   0   0   0   510  165  96   0   75  2073   943  1  1 98
 0  0  0 1003116  125604  3212   0   0   0  3754  152   0   0   67  5980   907  5  2 93
 0  0  0 2394772   57284 21020   0 113   0   854  169  17 104  146  3875  1322  4  3 93
 6  0  0 2394152   59336 26109 312  19   0 19441 1776   8 1109  724 21702  4755 11  8 81
 2  0  0 2414280   54228  1808  11   0   0 12634  397   2 10179 7100 49604 31971  2 90  9
12  0  0 2436492   77976 16700  28   0   0 36497 6088  65 7913 5995 51173 25110  3 97  0
 7  0  0 2476704   41100 15607   0   0   0 18981  180 120 8180 6085 51630 25286  5 95  0
 9  0  0 2470472   48788 10768   1   0   0 21362 2365  24 8013 5920 47188 24413 12 88  0
 2  0  0 2490760   59596 33094 195  21   0 41721 13561   6 3232 2575 39828 12341 36 34 30
 2  0  0 1002824  133152 21745   0   5   0 40107  152  28   3   35 14416  1577  8  3 88
 0  0  0  991936  128616    79   0   0   0   120  165  91   0   54   469   660  1  0 98
 0  0  0  991936  128616    11   0   0   0     0  150   0   0    3   416   258  0  0 99
 0  0  0  991936  128616    15   0   2   0     0  150   7   0   26  1085   645  0  0 99
 1  0  0 1002824  128492   364   0   0   0   272  150   0   0   15   736   351  0  0 100
 2  0  0 1002824  127604   572   0   0   0   589  165  51   0   94  4693  1169  1  0 99
 
Another update went smoothly on the UFS filesystem:

Code:
procs     memory       page                      disks     faults       cpu
r b w     avm     fre  flt  re  pi  po    fr   sr da0 da1   in   sy   cs us sy id
 1  0  0 1028784  148364   720  36   1   0   804  215   0   0   58  1464   734  2  1 97
 0  0  0 1062724  146308  1105   0   0   0   439   44   0   0   15  1570   354  1  1 98
 1  0  0 1104532  140188  4395   0   0   0  4106   40   0   0   61  5821   754  7  1 92
 1  0  0 2431620   81592 20969   0   0   0  1017   60   0   1   50  7290   800  5  3 91
 0  0  0 2458216   79508   985   0   0   0   754   60 105 171  274  2123  2126  3  3 94
 0  0  0 2426012   81092   288   0   0   0   935   60   0 188  200  1248  1621  4  0 96
20  0  0 1407848   73232 60882   0   0   0 56737   62   0 158  166 31367  3660 24 10 66
20  0  0 1407848   72072    98   0   0   0    44   70   0 210  196 70289  1020  1 99  0
15  0  0 1390048  108752  5656   0   0   0 16459   68   0 1003  903 58775  6446  4 96  0
 6  0  0 1410784   85592 13536   0   0   0 11532   73  76 3897 3362 61177 20300 16 84  0
 1  0  0 2498880   69992 24722   0  11   0 24895   67   0 3388 2868 43845 16055 22 48 30
 1  0  0 1028784  164560 24332   0  31   0 46527   50   0 158  135 13481  2010 15  6 79
 1  0  0 1028784  164560    10   0   0   0     0   40   0   0   10   491   319  1  0 99
 0  0  0 1039672  164744   506   0   0   0   270   44   0   0   11  1231   457  2  0 97
 0  0  0 1039672  164744    21   0   0   0     0   40  99   0   70  1229   936  1  2 97

On a hunch, on the second machine that froze, I executed zfs set compress=off zroot/usr/ports, removed everything and created a new clone, to compare. Before, it was set to lz4, the installation default. Interestingly, when updating the repo later, it was slower than on UFS but still didn't freeze the machine:

Code:
procs     memory       page                      disks     faults       cpu
r b w     avm     fre  flt  re  pi  po    fr   sr da0 cd0   in   sy   cs us sy id
 0  0  1 3308184  279356   279   8   8   5   377  910   0   0   70  1524 11603  3  1 96
 1  0  1 3308184  279356    78   0   0   0     0   70  75   0   42   420   516  1  0 99
 2  0  1 3297296  279356   111   0   0   0   121   70   0   0    5   541   237  3  0 97
 1  0  1 3297296  279356    36   0   0   0     2   70   0   0    5   475   235  0  0 100
 1  0  1 4710192  213116 21915   0   0   0   905   94  19   0   44  4183   681  4  2 93
 1  0  1 4724256  186888 17623   0  40   0  2420   92   1   0   29  4134   910 14  5 82
 1  0  1 4704828  125040  9815   0  25   0 16284   90 4074   0 2296 41292 973398  3 61 35
 0  0  1 4715716   91908   355   0   0   0 33867  100 728   0  539  7424 4620375  1 56 43
 0  0  1 4715716   55156    20 24307   0   0 28348 58737 742   0  489  5842 3485908  0 55 45
 0  0  1 4704828   47860   115 2516   0   0 33558 36015 688   0  454  5231 3044237  0 54 45
 5  0  1 4704828   50252    10 9405   0 183 30102 99483 844   0  439  4740 3872455  1 56 43
 0  0  1 4704828   55324    10 7522   0 824 38761 52709 1654   0  676  5921 3782050  0 55 45
 3  0 30 4715628   24292   253 1221  13 1803 28159 56368 2452   0  868  5038 3963197  0 56 44
 0  0 30 4715716   64036   282  56  77 1805 40266 460096 2408   0  955  4386 4344376  0 59 41
 2  0 31 4715716   41072    23  22   6 877 31367 314480 4711   0 2129 27947 3396464  2 63 35
 5  0 29 4704828  147904   278  53  55 326 30353 18428 12677   0 7030 88741 220799  1 98  1
 0  0 29 4704828   66080   464   0 394   0     8   66 2982   0 2885 14695 707077  8 87  5
 0  0 29 4738244   47704  5359   0 178   0  3498   70 4316   0 4350 17900 154334  6 15 79
 1  0 22 4698276  942104  5099   0 2254   0 263071   82 1904   0 1913   945 93499  8 17 75
 1  0 18 3200964 1030544  3423   0 113   0 25113   40 143   0  145  2360  6037  2  2 96
procs     memory       page                      disks     faults       cpu
r b w     avm     fre  flt  re  pi  po    fr   sr da0 cd0   in   sy   cs us sy id
 0  0 17 3200964 1024984   102   0  29   0     3   40 251   0  204   420  4818  0  2 98
 1  0 17 3200964 1024984    11   0   1   0     0   44   1   0    4   349   260  0  0 100

I'll keep monitoring the situation but for now the evidence appears to point towards ZFS as the possible source of trouble, specifically the compression.
 
Back
Top