How to make "impossible" memory allocations fail in a sane way?

zirias@ · Jan 3, 2022

Background: working on some service written in C that's using a BTREE database as provided by dbopen(3). This seems to use caching in memory extensively, so if you don't want to force a sync() after every change, you should make sure it's properly closed on exit if you don't want to lose data.

As there's no sane way to recover from OOM, I'm using the xmalloc() paradigm: wrap malloc(3) in a function that just exists on error. Normally, you use abort(3) for that, but then, there's no cleanup code executed. You could attempt to clean up from some SIGABRT signal handler, but that's fragile and cumbersome. So I came up with a different idea: My own "panic" function using longjmp(3) to throw away most of the calling stack, but still execute the final cleanup.

This works perfectly fine when just simulating an allocation error. But trying to get a real one, I had to learn malloc(3) just won't fail, even when trying to allocate more than your physical RAM + swap. Instead, the OOM killer will wreak havoc randomly killing large processes when you attempt to use that memory that doesn't really exist. ?

So, I came across the vm.overcommit sysctl. tuning(7) has the following to say about it:

Code:

     Setting bit 0 of the vm.overcommit sysctl causes the virtual memory
     system to return failure to the process when allocation of memory causes
     vm.swap_reserved to exceed vm.swap_total.  Bit 1 of the sysctl enforces
     RLIMIT_SWAP limit (see getrlimit(2)).  Root is exempt from this limit.
     Bit 2 allows to count most of the physical memory as allocatable, except
     wired and free reserved pages (accounted by vm.stats.vm.v_free_target and
     vm.stats.vm.v_wire_count sysctls, respectively).

Therefore, I tried sysctl vm.overcommit=1. A second later, my kernel (13.0-RELEASE-p4) panicked:

Code:

kernel:
syslogd: last message repeated 1 times
kernel: Fatal trap 12: page fault while in kernel mode
kernel: cpuid = 2; apic id = 02
kernel: fault virtual address     = 0x18
kernel: fault code                = supervisor write data, page not present
kernel: instruction pointer       = 0x20:0xffffffff80ca2596
kernel: stack pointer             = 0x28:0xfffffe00deaccb20
kernel: frame pointer             = 0x28:0xfffffe00deaccb80
kernel: code segment              = base 0x0, limit 0xfffff, type 0x1b
kernel:                   = DPL 0, pres 1, long 1, def32 0, gran 1
kernel: processor eflags  = interrupt enabled, resume, IOPL = 0
kernel: current process           = 2317 (chrome)      
kernel: trap number               = 12
kernel: panic: page fault
kernel: cpuid = 2
kernel: time = 1641225063
kernel: KDB: stack backtrace:
kernel: #0 0xffffffff80c58a85 at kdb_backtrace+0x65
kernel: #1 0xffffffff80c0b461 at vpanic+0x181
kernel: #2 0xffffffff80c0b2d3 at panic+0x43
kernel: #3 0xffffffff8108c1b7 at trap_fatal+0x387
kernel: #4 0xffffffff8108c20f at trap_pfault+0x4f
kernel: #5 0xffffffff8108b86d at trap+0x27d
kernel: #6 0xffffffff81062f18 at calltrap+0x8
kernel: #7 0xffffffff80ca138b at shm_truncate+0x5b
kernel: #8 0xffffffff80c77fa1 at kern_ftruncate+0xa1
kernel: #9 0xffffffff8108cabc at amd64_syscall+0x10c

(yes, there are two empty log lines from the kernel ...)

What's happening here? And is there a sane way to tell FreeBSD to fail on memory allocations that can obviously never be fullfilled?

zirias@ · Jan 3, 2022

Not sure this move makes sense. Although I came across this problem while doing "userland programming", it's actually a question about the base system (the kernel). Can we have a cross-post?

SirDice · Jan 3, 2022

I still consider it to be a userland programming question. It may have some overlap with FreeBSD Development (Kernel development, writing drivers, coding, and questions regarding FreeBSD internals). Certainly not a base OS "General" question.

zirias@ · Jan 3, 2022

So, what if I just want some other daemon, not written by me, to fail when allocating too much memory instead of triggering the OOM killer some time later? ?

My usecase is my own programming here, yes, but the question is about the vm.overcommit sysctl (and why setting it to 1 results in a kernel panic)...

covacat · Jan 3, 2022

setting it to 1 works for me but i tried on an mostly idle system
i speculate that is a bug in shm stuff that caused it to panic when changing 0->1
1 have a 1GB with zfs where i send snapshots / offsite backup
sometimes zfs receives bombs with out of memory and i tried to create a shit tool to pressure the memory allocator to shrink arc/other wired memory
to my surprise i could allocate 8GB without problems (even with calloc)
then i found out about vm.overcommit
setting it to 1 had some effect (i could reclaim some memory) but never panic-ed

zirias@ · Jan 3, 2022

Thanks covacat, this at least confirms this panic is "unexpected". Maybe worth a PR? It looks like it can hit a process while in a syscall as well (and did so quite promptly here), and I kind of doubt this is intended...

covacat · Jan 3, 2022

I think it's worth a PR
can you try to see if it bombs if you set it to 1 in sysctl.conf ?

shkhln · Jan 3, 2022

Zirias said:
Therefore, I tried sysctl vm.overcommit=1. A second later, my kernel (13.0-RELEASE-p4) panicked:

This exactly matches my experience, you don't want to touch that at all.

obsigna · Jan 3, 2022

Do atexit(3) handlers not work in case the daemon is stopped by your wrapped malloc?

In my daemons, I use the atexit mecahnism exactly for this purpose. One of my daemons installs 6 atexit handlers and all of these become called on normal return from main() as well as by any exit() on errors, and here among of these are many OOM error paths.

C:

...
if (initDAQ())
{
   atexit(freeDAQ);
   ...
   ...
   if (SSL_thread_setup())
   {
      atexit(SSL_thread_cleanup);
      ...
      ...
      if (initDatabases() && initPotentiostat())
         atexit(resetPotentiostat);
         ...
         ...
         /* starting the continuous measurement threads */
         atexit(stopMeasurements);
         ...
         ...
         /* open urandom */
         atexit(urandom_close);
         ...
         ...
         /* instantiate the inlne calculator */
         atexit(calculator_release);

zirias@ · Jan 3, 2022

Ok, thanks, so I'm not the only one.

I think I'll test covacat's suggestion first as soon as I find the time to reboot into a potentially unstable system, cause this was my thought as well: maybe there's just a bug making it dangerous to change it in-flight

mark_j · Jan 4, 2022

The (hybrid) demand paging used by FreeBSD always allows overcommits to memory. This is, by its nature, what virtual memory is.

Overcommitment of memory presumes, though, that there is sufficient backing store to provide the virtual address with a physical address to get memory should it actually need be used (from RAM or indirectly by swapping out something to disk).

When you set vm.overcommit to 10, you're telling the vm_map(9) subsystem to not worry about reserving swap space for this and other processes. (This is something inherited from Mach, if I recall correctly).
When you set vm.overcommit to 1:
If there isn't enough backing store to cover the allocation of virtual memory, then no more processes can be created. Eventually the system is just going to panic.

Funnily (sadistically), mmap(2) used to have a MAP_NORESERVE option to achieve this. (In fact a quick check shows Linux still does: https://man7.org/linux/man-pages/man2/mmap.2.html . Go figure!)

Edit: typing on tablet means i miss lines-such a small window to work with.. Apologies.

zirias@ · Jan 4, 2022

mark_j said:
The (hybrid) demand paging used by FreeBSD always allows overcommits to memory. This is, by its nature, what virtual memory is.

If you call it overcommit to allow more pages than would fit in physical memory at the same time, then yes. I'm talking about allowing more than could ever be backed, including swap...

mark_j said:
When you set vm.overcommit to 1, you're telling the vm_map(9) subsystem to not worry about reserving swap space for this and other processes. (This is something inherited from Mach, if I recall correctly).

This doesn't sound quite right. I understood it the other way around from the manpage. But maybe the manpage isn't correct or I don't understand it correctly? (I quoted the relevant text in my first post...) – with this sysctl set to 0 (the default), you can successfully allocate an amount of memory larger than your physical RAM and swap together...

What I want is malloc() (which, IIRC, uses mmap() internally) to fail when requesting an amount that couldn't be backed. Is there a way to have that?

covacat · Jan 4, 2022

it should be 1, you understood correctly

Ambert · Jan 4, 2022

I am new to FreeBSD, but here is how I would do it.

To detect an attempt to allocate more memory with malloc() than the size of physical RAM + swap, I would compute that size myself by requesting the relevant information to the operating system, and then I would monitor the size occupied by my process (excluding shared libraries), to see if it grows beyond the sum of physical RAM and swap.

However, if I was in your shoes, I would also monitor the amount of free memory still available, and adapt the frequency of my calls to sync() accordingly. For instance, if the amount of free memory is greater than 2 GiB, I would call sync() every 10 minutes. Otherwise, if the amount of free memory is between 0.5 and 2 GiB, I would call sync() every 3 minutes. And if the amount of free memory is less than 0.5 GiB, I would call sync() every minute and every time I call malloc().

shkhln · Jan 4, 2022

There is MADV_PROTECT, but it doesn't seem to be appropriate here, its purpose is mostly in keeping a few key daemons (like init) alive.

Anyway, you can never depend on your exit code being called. There are too many ways a process could go down: the operating system could crash altogether, PSU fail, etc. This should be taken into account, which likely means calling sync on a fixed (configurable) interval. It should not depend on the amount of free memory or anything like that. It's probably a good idea to put some kind of limit on the database size, though.

zirias@ · Jan 4, 2022

Well, I'm not interested in workarounds as of now. Of course, you can never 100% make sure your daemon (or the system it's running on) doesn't crash, therefore my plan is to add explicit sync() calls on any "sensitive" change (while just normal content won't trigger a sync).

But: OOM should be a condition that can be handled at least with a clean exit. And this would work if I'd ever get an error (null returned) from malloc() et al. So, for now, I'm back to this sysctl supposed to control "overcommit" behavior in FreeBSD

shkhln · Jan 4, 2022

Zirias said:
But: OOM should be a condition that can be handled at least with a clean exit. And this would work if I'd ever get an error (null returned) from malloc() et al. So, for now, I'm back to this sysctl supposed to control "overcommit" behavior in FreeBSD

The thing is, if some process consumes an amount of memory you failed to predict, this is already a problem. OOMs should never happen in normal operation.

covacat · Jan 4, 2022

you get null from malloc with overcommit=1
didn't try the other bits but 1 works

ralphbsz · Jan 5, 2022

There is another aspect to this that has not been discussed explicitly. When you call malloc() or any of its friends (like sbrk() and mmap()), you don't actually get any memory. All that really happens is that your address space is adjusted, so you can actually start using more memory in those new address ranges. What does "using" in the above sentence mean? When you first touch an address in that new address range, a page fault will occur, and the page fault handler will actually give you a physical memory page. To do that, it either has to find a free memory page, evict something else from memory and give you that page (typically that's an already-written file system buffer), or it has to take some other process's address space, write it out to swap, page-protect it (so that other process can't use it for a while), and give you the page.

And that's one of the wonderful contradictions of malloc(): It doesn't actually have to fail. It can just pretend to give you memory, under the (very reasonable) assumptions that most programs that allocate memory will never actually use it. Linux is quite famous for malloc() hardly ever failing (except for ulimit-style settings). Instead, it gives you the illusion of memory, and when you try to use it, you'll get a segfault at a random place in your code, where you can't put error handling. Sure, you could set up a signal handler for SIGSEGV or SIGBUS, but what useful action can that signal handler take? In particular since most seg faults are not caused by running out of memory, but by coding bugs? And in particular since the signal handler can't actually do anything productive (like create more memory out of nothing, or check and repair all data structures the program has in memory).

I subscribe to the philsophy: Don't bother handling malloc errors. Instead think about the memory usage of your program, think about what type of computer it is installed on (how much physical + swap is available), and control memory usage yourself. One of the reasons for this attitude is this: there is another reason for segfaults, which is the stack. And while the stack today can get very big (in userspace, not in the kernel), there is no mechanism like malloc to manage stack space. So instead of trying to handle errors, write your code to have fewer errors in the first place. And then, when errors happen ... which they will ...

shkhln said:
Anyway, you can never depend on your exit code being called. There are too many ways a process could go down: the operating system could crash altogether, PSU fail, etc. This should be taken into account, ...

Your code will occasionally crash. You can minimize the number of crashes by good engineering, but not eliminate them. To quote an old colleague: In a sufficiently large system, the unlikely will happen all the time, and the impossible will happen occasionally. I once saw a process crash due to a CPU fault (which was correctly reported and logged, the system continued running on the three surviving CPUs). So prepare for your code to crash. As shkhln said, there are standard techniques for that: Write checkpoints, sync your state to permanent (persistent) storage, automatically restart, use deadman timers or liveness checks or deadlock preventers to crash the system if things are wedged. It can even be a good practice to automatically reboot your computer at random times (once a day for example), to make sure your crash-handling code is well exercised. Overall long-term reliability doesn't come from just one aspect (such as malloc), but from taking a whole-system view.

With good automated recovery and handling all other forms of crashes, malloc() problems are just one of the many things

zirias@ · Jan 5, 2022

ralphbsz I wasn't looking for lectures about "good programming" here (one of the reasons I don't think this question belongs into the programming section) but instead for some insight about the configurable behavior of FreeBSD's virtual memory management and why reconfiguring it leads to a kernel panic

Still nice intro to the general workings of virtual memory, but one thing sticks out:

ralphbsz said:
under the (very reasonable) assumptions that most programs that allocate memory will never actually use it.

How is that ever "reasonable"? If you said "rarely", ok, that's why swapping out pages makes sense. But not use it at all? Why should you ever reserve memory if you'll never write to it? I'd call that ill program design...

All the countless reasons your program could crash aside (you can eliminate intrinsic reasons in theory, but not environmental reasons): Running out of memory is a condition that allows at least a "graceful" exit, if your program would learn about it the moment it tries to reserve memory. vm.overcommit should allow to configure that (as covacat confirmed). And at least in my definition, "overcommitting" here means to allow more reservations than there is total backing store (physical RAM + swap) for all the pages required.

The bad thing about that practice is: Once the system learns it can't map all the currently needed pages to physical RAM any more, the only resort is the OOM killer, randomly killing some large process (so, any process in the system can be affected). A broken program just reserving insane amounts of memory will be able to bring down other processes on the same machine. That's something virtual memory was originally designed to avoid.

ralphbsz said:
I subscribe to the philsophy: Don't bother handling malloc errors.

That sounds like a consequence of the behavior of today's systems. Could lead to a vicious cycle: If no application software ever bothers handling the problem, there's no use signalling it. And if there's indeed a lot of software reserving memory it will never use, hoping for that is a somewhat appropriate strategy. But I wouldn't call that "sane", at least it isn't robust.

edit: about stack space, I don't really see a problem with that. As long as you use neither VLAs, stuff like alloca() or recursion, you can guarantee an upper bound for stack usage of your program (and any algorithm can be implemented without these).

shkhln · Jan 5, 2022

Zirias said:
Why should you ever reserve memory if you'll never write to it? I'd call that ill program design...

Mmaping a large file is one such use case. Are you sure dbopen doesn't do this?

covacat · Jan 5, 2022

but files are their own backing store like private swap

shkhln · Jan 5, 2022

I'm not quite sure how it all works, especially with overcommit disabled, however it definitely fits "reserve first, decide what to read/write later" pattern. From the userspace application point of view, of course.

Ambert · Jan 5, 2022

Zirias said:
ralphbsz said:

under the (very reasonable) assumptions that most programs that allocate memory will never actually use it.

Click to expand...

How is that ever "reasonable"? If you said "rarely", ok, that's why swapping out pages makes sense. But not use it at all? Why should you ever reserve memory if you'll never write to it? I'd call that ill program design...

[...]

A broken program just reserving insane amounts of memory will be able to bring down other processes on the same machine. That's something virtual memory was originally designed to avoid.

Handling dynamic memory allocations with malloc() makes your program vulnerable to memory fragmentation. If this is not acceptable, a solution is to use mmap() instead, to have better control over the layout of the virtual address space of your process. To avoid fragmentation of the virtual address space, I reserve a huge range of it with mmap(), and then I effectively use only a small part of it by writing on some of the pages. If I didn't reserve a huge part of the virtual address space, some functions from another library could surround my contiguous data structure by memory mappings, and it would prevent me from increasing the size of that contiguous data structure when the program needs to allocate more memory. I know that the flag MAP_GUARD exists for that purpose in FreeBSD, but sadly this flag is not portable to other Unix systems. To reserve a range of memory addresses, I mmap them with the flag PROT_NONE. I don't know if such a reserved range is a problem when vm.overcommit is set to 1.

I had the same frustration as you when I discovered that malloc() do not fail because of the overcommit thing, so now I use mmap() instead. It gives you much better control over the memory. mmap() is somewhat portable to other BSDs, to Linux and to macOS, but it is not portable to Windows (unless WSL becomes a thing).

zirias@ · Jan 5, 2022

Ambert said:
Handling dynamic memory allocations with malloc() makes your program vulnerable to memory fragmentation. If this is not acceptable, a solution is to use mmap() instead, to have better control over the layout of the virtual address space of your process. To avoid fragmentation of the virtual address space, I reserve a huge range of it with mmap(), and then I effectively use only a small part of it by writing on some of the pages. If I didn't reserve a huge part of the virtual address space, some functions from another library could surround my contiguous data structure by memory mappings

This actually makes sense. Of course, realloc() would still work, but potentially copy large areas of memory...

Still it feels like a workaround for a flawed design. In a perfect world, reserving (virtual) address-space could be clearly separated from reserving actual memory, so the application has a sane way to react when a memory request can't be fullfilled... ?

BTW, the service I'm currently building has mostly smaller and transient "allocated objects", so using malloc() isn't a problem for me (realloc() is rarely needed and only for not too large objects as well). But I understand your usecase.

How to make "impossible" memory allocations fail in a sane way?

zirias@

zirias@

SirDice

Administrator

zirias@

covacat

zirias@

covacat

shkhln

obsigna

Profile disabled

zirias@

mark_j

zirias@

covacat

Ambert

shkhln

zirias@

shkhln

covacat

ralphbsz

zirias@

shkhln

covacat

shkhln

Ambert

zirias@