How to make "impossible" memory allocations fail in a sane way?

Do we know since when calls to malloc(3) never fail, and who implemented this OOM Killer? I am curious about, what stuff this developer
is smoking, wherever he lives. This looks also like a big security flaw. I would not be surprised, if this could be easily exploited at least for some denial of service attacks.

For example, I experienced a major hassle in October/November 2020 with a tiny AWS-EC2 instance, which was for years perfectly running an Apache/PHP/MySQL web service. All over a sudden, Apache continued to being killed because of for some reason MySQL ran out of swap space. I found it already stupid to kill Apache and not MySQL.

Does this sound familiar?

I never was able to find out, what request actually triggered this, I am even not sure whether there was a trigger at all, perhaps the memory hog MySQL is a leaking hog. Finally, I added a second volume to the EC2 instance, 1 GB for swap only, and that solved the problem.

For my daemons, I use a malloc wrapper. It got already some introspection facilities, like total allocations, and I will add a configurable limit which hopefully keeps my daemons below the radar of the OOM Killer. In case it becomes killed, a watchdog will restart FreeBSD then, because killing the OOM Killer is the only clean solution.
 
OK the linked commit is from 1994, and the commit message fits to its age - keyword floppy situation.
Various changes to allow operation without any swapspace configured.
Note that this is intended for use only in floppy situations and is done at
the sacrifice of performance in that case (in ther words, this is not the
best solution, but works okay for this exceptional situation).

Now, the question remains when somebody decided to make this the general, i.e. not-only-floppy behaviour. And I cannot see anything in this commit which let's malloc never fail, which was the actual question.

Not at all, OOM kills can only be triggered by exhausting available memory, which would be a DoS situation even without overcommit.
Well, I still feel uncomfortable. The delayed OOM makes me nervous. And why is there a sysctl setting vm.overcommit, which defaults to 0 (I assume this means inactive), and the system happily does overcommiting up to hundreds of thousands of gigabyte? Fortunately, trying to allocate a petabyte makes the kernel feel uncomfortable as well and it eventually bails out.
 
To the panic itself: please do open a PR. Panic the way you shared is not expected to happen whatever memory pressure you try to do from userspace. Your panic on virtual address 0x18 is clearly a bogus kernel address.
I tried to rub the bug several ways while while true; do sysctl vm.overcommit=1 ; sysctl vm.overcommit=0; done was running in background on 13.0p4 but was not able to trigger anything.

Are you able to reproduce this crash?
 
And I cannot see anything in this commit which let's malloc never fail, which was the actual question.
No, the question this answers is "who implemented this OOM Killer?". (Not exactly the same thing as overcommit, by the way — it's possible to handle OOM situations by simply crashing, for example.) I'm not going to study jemalloc's sources.
 
To the panic itself: please do open a PR. Panic the way you shared is not expected to happen whatever memory pressure you try to do from userspace. Your panic on virtual address 0x18 is clearly a bogus kernel address.
I tried to rub the bug several ways while while true; do sysctl vm.overcommit=1 ; sysctl vm.overcommit=0; done was running in background on 13.0p4 but was not able to trigger anything.

Are you able to reproduce this crash?
try to run some large process that uses shm
it never panic-ed for me on an 1GB system and i changed it lots of times
 
...
And why is there a sysctl setting vm.overcommit, which defaults to 0 (I assume this means inactive), and the system happily does overcommiting up to hundreds of thousands of gigabyte? Fortunately, trying to allocate a petabyte makes the kernel feel uncomfortable as well and it eventually bails out.
Answering myself — reading helps tuning(7):
The vm.overcommit sysctl defines the overcommit behaviour of the vm
subsystem. The virtual memory system always does accounting of the swap
space reservation, both total for system and per-user. Corresponding
values are available through sysctl vm.swap_total, that gives the total
bytes available for swapping, and vm.swap_reserved, that gives number of
bytes that may be needed to back all currently allocated anonymous
memory.

Setting bit 0 of the vm.overcommit sysctl causes the virtual memory
system to return failure to the process when allocation of memory causes
vm.swap_reserved to exceed vm.swap_total. Bit 1 of the sysctl enforces
RLIMIT_SWAP limit (see getrlimit(2)). Root is exempt from this limit.
Bit 2 allows to count most of the physical memory as allocatable, except
wired and free reserved pages (accounted by vm.stats.vm.v_free_target and
vm.stats.vm.v_wire_count sysctls, respectively).
So, vm.overcommit = 0 means that it is completely active. In order to enforce the actual limits, it must be set to b0|b1|b2 = 7, and we must not allocate memory as root.

So I tried:
# sysctl vm.overcommit=7

Now I check the setting with my test program from this post: https://forums.freebsd.org/threads/...ocations-fail-in-a-sane-way.83582/post-549632

sudo -u rolf ./oomcheck 1 (trying to allocate 1 GB, does work as expected):
1073741824, 0x0000000801200700

sudo -u rolf ./oomcheck 10 (trying to allocate 10 GB, does not work, which is also the expected behaviour, since this system does not have 10 GB):
10737418240, 0x0000000000000000

The user root can still allocate any amount of memory below one petabyte.
 
obsigna you keep talking about malloc/jemalloc although it's pretty obvious this has nothing to do with it: malloc is just a userspace allocator and memory manager built on top of what the kernel provides to get page mappings (well, nowadays most likely just mmap). If and only if malloc needs to get memory from the kernel and that fails, it will return NULL.

Also, I'm not sure you understand all the implications? After discovering that on my desktop machine, astounding 600GB of swap are "reserved" (with just 8GB actually available), I'm very reluctant to disable overcommit. I suspect in my case, it's chromium reserving all that memory, but then, what does it help? As discussed in this thread, having contiguous virtual address space is a valid requirement, and the only way to get that with the existing APIs, unfortunately, is to reserve a huge chunk of memory *). Obviously there are programs doing just that, probably well aware systems will overcommit, so "it isn't a problem". Then of course, you can't disable overcommit without breaking these programs (or at least have lots of RAM go to waste). A vicious circle... and a huge pile of suck.

Are you able to reproduce this crash?
I didn't try just yet. It happened on my desktop machine I use for my personal dev stuff, but I really need this machine operational, especially in these times of remote working .... will probably do more tests some weekend ?

-----
*) oh well, just thinking about another workaround, you might just mmap a sufficiently large file instead. But that sucks as well...
 
Zirias, you don’t understand how it helps not to mix-up the different domains. I keep on talking about malloc(3) because that’s the API for us user space developers, and I expect that the API works as advertised in the respective man page. And specifically that means that a NULL pointer is returned, when a memory allocation cannot be full filled, and I mean memory that can be really used, not phantom one that leads to a crash when the program starts writing to it.

OK, here I learned that funny things may happen in kernel space, to say the least. Be assured that I understand all the implications very well, and be assured, that I will find out the pros and cons of setting vm.overcommit to 7 for the systems which I am running FreeBSD on. Beyond this, I won’t turn myself into a kernel hacker, I got better things to do. In case it turns out that vm.overcommit leads into a dead-end road, I got already plan B.

So, good luck everybody with whatever your plans A, B or 0 are.
 
I keep on talking about malloc(3) because that’s the API for us user space developers, and I expect that the API works as advertised in the respective man page.
Wait, that's what I was doing initially ;) But you brought jemalloc (a concrete implementation) to the table, suggesting it would be at fault here, which it isn't: any malloc implementation will just use the memory it gets from the kernel ;)
In case it turns out that vm.overcommit leads into a dead-end road, I got already plan B.
It probably does, at least if you want to use programs that reserve address-space by reserving huge chunks of memory. I don't have a "plan B" :( Well, other than accepting the situation as it is. "Fixing" it would require additional kernel APIs and userspace applications using them correctly...
 
Well, I got actually a working and already implemented plan C, although for other reasons, than the one here.

A group of my daemons are electrochemical measurement controllers, and the potential/current measurements are done with high speed PCIe-DAQ boards from National Instruments. For the high speed mode, quite big amounts of system RAM need to be provided for DMA. For this, I wrote a kernel module which allocates on boot 256 MB of RAM for AI-DMA and 48 MB of RAM for AO-DMA. This reserved memory is no more visible to other processes, however, by the way of ioctl's in my kernel module, the whole chunks may be mapped into user space, and by this way my program gets access to the waveforms and measurement data belonging to the DMA channels of the DAQ board.

Of couse, this would work also without any DAQ-board. We let a kernel module reserve the memory, and map it into our-only user space by an ioctl. From my experience other programs don’t touch it, otherwise, the HS measurements would give strange results, like arbitrary spikes, voids and punches in the curves - they don't. In this case, we need to write our own allocator, which takes the memory out of said our-only pool.

By this way, our daemons could keep themselves below the radar of the OOM killer. Even if we use preallocated system RAM let’s say 1 GB, the OOM killer won’t know this and if we don’t allocate much regular space, then there would always be processes which allocated much more, and therefore are subject to be killed before our daemon.
 
obsigna I get you're talking about "wired" memory. That's nice and all, but it doesn't solve the generic problem: You can't attach a kernel module to every daemon ;)
 
obsigna I get you're talking about "wired" memory. That's nice and all, but it doesn't solve the generic problem: You can't attach a kernel module to every daemon ;)
Of course, I don’t want to attach it to every daemon, but only to my. If every daemon uses it, what would be the benefit for me? That’s my plan C after all, for outpacing a questionable system behaviour,
 
As discussed in this thread, having contiguous virtual address space is a valid requirement, and the only way to get that with the existing APIs, unfortunately, is to reserve a huge chunk of memory *). Obviously there are programs doing just that, probably well aware systems will overcommit, so "it isn't a problem". Then of course, you can't disable overcommit without breaking these programs (or at least have lots of RAM go to waste).

What do you mean by "existing APIs"? Do you include mmap() in those APIs? Or just malloc()? Because I think mmap() does allow the reservation of a range of virtual addresses that does not count as allocated memory when the kernel enforces a strict no overcommit policy. Each virtual memory page of (typically) 4 KiB has a status regarding its access (read, write, execute), and I think only memory pages having the WRITE attribute count as allocated memory under a no overcommit policy. You can temporarily turn off the WRITE attribute of the memory pages that you are not currently using.

You can check that out by running the following test program. It takes a size (in GiB) as argument, and calls mmap() to reserve a range of virtual addresses of that size (first with MAP_GUARD, then with PROT_NONE, then with PROT_READ, then with PROT_WRITE), allowing you to check what kind of reservation has an impact on the amount of reserved swap on your computer.

C:
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <sys/mman.h> // mmap
#include <unistd.h> // getpid

void print_address_space(void) {
    fflush(stdout);
    char command[100];
    snprintf(command, 100, "procstat vm %ld", (long)getpid());
    system(command);
}

void * mmap_wrap(void *addr, size_t len, int prot, int flags, char *taskname) {
    void *addr2 = mmap(addr, len, prot, flags, -1, 0);
    if (addr2 == MAP_FAILED) {
        printf("mmap() failure.\n");
        exit(EXIT_FAILURE);
    }
    printf("%s done (at %p). Map of the virtual address space:\n", taskname, addr2);
    print_address_space();
    printf("Check swap reservation now. Then press Enter to continue.\n");
    getchar();
    return addr2;
}

int main(int argc, char *const argv[]) {
    if (argc < 2) {
        printf("Please provide a size (in GiB) as argument.\n");
        exit(EXIT_FAILURE);
    }
    unsigned long size = strtoul(argv[1], NULL, 10)*1024*1024*1024;
    void *ptr = mmap_wrap(NULL, size, PROT_NONE,
        MAP_PRIVATE | MAP_ANON | MAP_GUARD, "mmap(MAP_GUARD)");
    ptr = mmap_wrap(ptr, size, PROT_NONE,
        MAP_PRIVATE | MAP_ANON | MAP_FIXED, "mmap(PROT_NONE)");
    ptr = mmap_wrap(ptr, size, PROT_READ,
        MAP_PRIVATE | MAP_ANON | MAP_FIXED, "mmap(PROT_READ)");
    ptr = mmap_wrap(ptr, size, PROT_READ | PROT_WRITE | PROT_EXEC,
        MAP_PRIVATE | MAP_ANON | MAP_FIXED, "mmap(PROT_WRITE)");
    return EXIT_SUCCESS;
}

Unfortunately, the simple interface provided by malloc() does not allow it to be implemented in a way that suits your needs: all the reserved pages have to be set as writable immediately, because the user can write at any address at any time (and an attempt to write a byte in a virtual memory page defined as read-only triggers a segmentation fault, it does not make the memory page writable). And since malloc() is standard and easy to use, most people just use it instead of mmap().

That being said, this thread of the freebsd-hackers mailing list mentions other valid use-cases for overcommitment. So the simple interface provided by malloc() is not the only culprit.

----------------------------

obsigna -- I think there is an easy way to make a process disappear from the radar of the OOM killer: make your process call the function madvise with MADV_PROTECT as argument (with root privileges).
 
Last edited:
What do you mean by "existing APIs"? Do you include mmap() in those APIs? Or just malloc()? Because I think mmap() does allow the reservation of a range of virtual address space that does not count as allocated memory when the kernel enforces a strict no overcommit policy. [...] You can temporarily turn off the WRITE attribute of the memory pages that you are not currently using.
I'm not sure this helps with the problem at hand: You want to reserve a (potentially huge) chunk of contiguous address space, and then you want to reserve actual memory backing it page-by-page, as needed. I think with mmap(), you can only change the access mode for the whole chunk? Correct me if I'm wrong...

obsigna -- I think there is an easy way to make a process disappear from the radar of the OOM killer: make your process call the function madvise with the MADV_PROTECT behaviour (with root privileges).
Ah, the API that goes with protect(1). Interesting idea! But, of course, still a workaround ;)
 
...

obsigna -- I think there is an easy way to make a process disappear from the radar of the OOM killer: make your process call the function madvise with the MADV_PROTECT behaviour (with root privileges).
For me this is the solution, to prevent my daemon from being inadvertently killed, which would be problematic, because the DAQ board(s) might be left in a non-appropriate non-idle state, and nothing would be supervising it -> may (remotely) result in the destruction of the electrochmical cell, depending on which method was running.

So, I will implement this with my plan B, i.e, using my malloc wrapper to impose a reasonable limit on how much memory my daemon might allocate. If this would be exceeded for some reason — usually a bug, then it would quit gracefully and it could leave a respective message.

By the way, I never understood why the Linux people like mmap to anything so much. Mapping memory is computational expensive, I know this, because I did this already step by step in my kernel module. So mmap is not a toll free bridge to all kind of storage, the toll is quite expensive.

For example, I did already the first experiments with vm.overcommit = 7. Once I increased the swap partition to 16 GB, I could run the GNOME 3 desktop and most of its applications without problems. Once I started Firefox, its first tab crashed, because 5 times of 2.5 GB swap space allocation exceeded the limit. And now comes the best of all, this idiotic OOM killer did not kill Firefox, but reproducibly something of the ssh/bash/tty session, by which I was logged-in from another machine, in order to be able to monitor the system. Can this be more stupid?
 
By the way, I never understood why the Linux people like mmap to anything so much. Mapping memory is computational expensive, I know this, because I did this already step by step in my kernel module. So mmap is not a toll free bridge to all kind of storage, the toll is quite expensive.
I would be surprised if any malloc() code would still use sbrk() (and then, mapping pages needs to happen with sbrk() as well). When you use malloc(), you already use mmap().
 
I'm not sure this helps with the problem at hand: You want to reserve a (potentially huge) chunk of contiguous address space, and then you want to reserve actual memory backing it page-by-page, as needed. I think with mmap(), you can only change the access mode for the whole chunk? Correct me if I'm wrong...

I think you are wrong. I think we can change the access protection on a page-by-page basis (although it is better to keep pages with the same access protection compacted together, to save kernel memory). And everything I said in my previous post does not help you with your issue, but you repeated several times something I think is wrong, so I took the time to write an explanation (you are the OP).

obsigna -- Next time there is a call for Foundation-supported project ideas, you can suggest to give the admin the ability to rank processes according to their importance, so that the OOM killer will target the low ranking processes first (currently, I think there are only two ranks: untouchable and fair game).
 
you can suggest to give the admin the ability to rank processes according to their importance, so that the OOM killer will target the low ranking processes first (currently, there are only two ranks: untouchable and fair game).
Since the oom killer only kicks in if something is really wrong, i think it's better if you notice it as soon as possible. Killing lower ranked processes will introduce a delay until you notice something's wrong.

I think the current two ranks are the best solution here.
 
I think you are wrong. I think we can change the access protection on a page-by-page basis
If that's indeed possible, could you give an example how? Cause then I think "overcommit" would be unnecessary if all programs were well-behaved... (well, disregarding this fork() issue for now)
 
obsigna-- Next time there is a call for Foundation-supported project ideas, you can suggest to give the admin the ability to rank processes according to their importance, so that the OOM killer will target the low ranking processes first (currently, I think there are only two ranks: untouchable and fair game).
Basically I am happy with the two „ranks“, which as eternal_noob said should be sufficient. I am unhappy with the poor choices of the OOM killer in the fair game ranks. It should kill the process which actually caused the OOM, and not unrelated processes. Does this make any sense to anyone that Firefox (the culprit) continues running, while one of (or all) sshd/bash/tty becomes killed?

[Edit]: The same with my other incident in October/November 2020 (s. #50). Why it kills Apache and not the culprit MySQL? A stupid and cumbersome decision.
 
Does this make any sense to anyone that Firefox (the culprit) continues running, while one of (or all) sshd/bash/tty becomes killed?
Yes, the process currently needing RAM is currently doing some work, so chances are it's more "important" than some other, idle, process that holds a sufficient amout of RAM. Yes, that's a very imperfect heuristic. The OOM-killer is a last resort and you don't want to ever need it...
 
I am sorry I started it, but we should avoid talking about improving FreeBSD, or this useful thread might be locked. The call for project ideas was a special temporary exception.

DutchDaemon said:
As of today, FreeBSD Forums staff will actively close down (and eventually remove) topics that serve no other purpose than to complain that "FreeBSD is not (like) Linux" (or Windows, or MacOS, or any other operating system), or that "FreeBSD does not use systemd", or that "FreeBSD has no default GUI", or that "FreeBSD does not encrypt gremlins", etc. This also includes topics that devolve into that kind of debate.


Note that this is a general user and administrator forum, where the community aims to assist those who want to install, run, or upgrade FreeBSD as-is. Discussions about what FreeBSD needs to be, or needs to add, or needs to lose, are pointless on the forums. We do not maintain the operating system here.

Now, to get back to how to use FreeBSD as-is, here is a test program showing how to change the access protection of the virtual address space, on a page-by-page basis:

C:
#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <sys/mman.h> // mmap
#include <unistd.h> // getpid sysconf

void print_address_space(void) {
    fflush(stdout);
    char command[100];
    snprintf(command, 100, "procstat vm %ld", (long)getpid());
    system(command);
}

int main(void) {
    unsigned long page_size = sysconf(_SC_PAGESIZE);
    unsigned long range_size = 100*page_size;
    printf("Page size: %lu bytes.\n\n", page_size);

    printf("Initial address space:\n");
    print_address_space();

    void *ptr = mmap(NULL, range_size, PROT_WRITE | PROT_EXEC,
                     MAP_PRIVATE | MAP_ANON, -1, 0);
    printf("\nAddress space after a range of 100 contiguous pages have been\n"
           "allocated with write+exec protection at %p:\n", ptr);
    print_address_space();

    void *first_page_address = ptr;
    mmap(first_page_address, page_size, PROT_EXEC,
         MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
    printf("\nThe first page of the range has been deleted and replaced with\n"
           "a new exec-only page (at %p).\n", first_page_address);

    void *second_page_address = ((char*)first_page_address) + page_size;
    mmap(second_page_address, page_size, PROT_NONE,
         MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
    printf("\nThe second page of the range has been deleted and replaced with\n"
           "a new page (PROT_NONE) (at %p).\n", second_page_address);

    void *third_page_address = ((char*)second_page_address) + page_size;
    mmap(third_page_address, page_size, PROT_READ | PROT_WRITE | PROT_EXEC,
         MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
    printf("\nThe third page of the range has been deleted and replaced with\n"
           "a new page (with read+write+exec access) (at %p).\n",
           third_page_address);

    printf("\nFinal address space:\n");
    print_address_space();

    return 0;
}

Edit: I have tested this program with QEMU on FreeBSD-13.0-RELEASE-amd64.qcow2, and it works as expected.

There is a way to change the access protection of a range of virtual memory addresses without removing the data, but the granularity of the change is not guaranteed to be a single page.
 
Last edited:
Ambert Uh-oh ... this code doesn't look "pretty" for sure, but I take your word it works, so, it actually fullfills the requirement. Maybe someone should write a better allocator than standard malloc() using this technique behind the scenes then. Used consequently, it would eliminate at least one important reason for overcommit!

In fact, I wasn't aware you can "re-mmap" just parts of what you mapped before, that's a (pleasant!) surprise.

BTW, I think you misunderstand this "forum rule" a bit. Nobody on here has a problem with general OS design discussions, as long as they are not plain requests to change FreeBSD (which would of course make no sense on here, and typically come from e.g. systemd-fanboys ;))
 
You should not take my word for it. For the moment I can't have access to a FreeBSD installation (I am stuck on a Linux install), so the code was not tested properly. That's why most of my sentences begin with "I think that". But I think you can execute this code on any FreeBSD install and you will understand how it works just by looking at the output of the program. I could have made that program prettier, but it would not have been a good introduction to mmap(). Edit: I have tested the code on FreeBSD over QEMU and it works as expected.

Used consequently, it would eliminate at least one important reason for overcommit!

See one of my previous posts:

Unfortunately, the simple interface provided by malloc() does not allow it to be implemented in a way that suits your needs: all the reserved pages have to be set as writable immediately, because the user can write at any address at any time (and an attempt to write a byte in a virtual memory page defined as read-only triggers a segmentation fault, it does not make the memory page writable). And since malloc() is standard and easy to use, most people just use it instead of mmap().

In addition, if you want to write portable code, malloc() is a good bet. Otherwise you have to abstract the memory handling in some kind of library, and use mmap() for Unix systems and VirtualAlloc() for Windows systems (and potentially other low-level memory functions for other operating systems). And VirtualAlloc() cannot do everything mmap() can.

In particular, mmap() can reserve the entire virtual address space without memory overhead (as long as all the pages have the same access protection). But with VirtualAlloc(), reserving a range of virtual addresses has a cost of 1 bit per 64 KiB, and if you write over that memory, the cost becomes 8 bytes per 4 KiB (which is 0.2%), even if you release the written pages by setting their access protection to none. You have to completely release the entire reserved area to get almost all of your memory back. I say "almost" because you can never get back the reservation cost of 1 bit per 64 KiB, unless you terminate the process. That's true for Windows 8 and 10 (x86_64). I don't know about Windows 11.
 
Last edited:
Back
Top