Solved Troubleshoot of "a thread waited too long to allocate a page"

Hi All.

I got a system problem since FreeBSD 13 as follows (Prior 12, I have no impression about the issue).
Code:
Jan 30 11:22:47 test kernel: [116123] pid 50340 (mysqld), jid 1, uid 88, was killed: a thread waited too long to allocate a page
Jan 30 11:22:47 test kernel: [116134] pid 32665 (php-fpm), jid 2, uid 80, was killed: a thread waited too long to allocate a page
...
Jan 30 11:39:07 test kernel: [117080] swap_pager: cannot allocate bio
...
Jan 30 14:06:11 test kernel: [125979] watchdog pre-timeout, WD_SOFT_LOG
Jan 30 14:06:11 test kernel: [125979] watchdog pre-timeout, WD_SOFT_PRINTF
Jan 30 19:30:33 test kernel: [127986] pid 61858 (pickup), jid 3, uid 125, was killed: failed to reclaim memory
System freeze until hardware reset.

Version of FreeBSD.
Code:
FreeBSD epopen.com 13.1-RELEASE-p3 FreeBSD 13.1-RELEASE-p3 releng/13.1-n250169-c3c13035ef2 EpopenKernel_13_1-RELEASE_2022-05-08 amd64

My system installed physical memory size: 8GB and SWAP size: 2GB
Using (root on) ZFS and ARC

And monitor system resource continue and trying to find out which one exhaust memory by top as follows
Code:
last pid: 13312;  load averages:  0.26,  1.02,  1.00                                                                                                                               up 0+14:53:11  10:23:30
102 processes: 1 running, 100 sleeping, 1 zombie
CPU:  0.8% user,  1.5% nice,  0.7% system,  0.8% interrupt, 96.3% idle
Mem: 228M Active, 1532M Inact, 964M Wired, 40K Buf, 5113M Free
ARC: 214M Total, 68M MFU, 83M MRU, 163K Anon, 2216K Header, 61M Other
     58M Compressed, 197M Uncompressed, 3.38:1 Ratio
Swap: 2048M Total, 2048M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   49 root         52  20    0   103M    53M select   0   2:53   0.00% python3.9
55271 mysql        57  20    0   926M   612M select   1   2:15   0.00% mysqld
   30  11211       10  20    0    79M    32M kqread   1   0:15   0.00% memcached
23190 www           1  24    0   232M    73M accept   0   0:11   0.00% php-fpm
   32 www          54  20    0    97M    34M piperd   0   0:09   0.00% httpd
....

After google, I got some information about tuning system variable, but I don't understand, therefore known current value list as follows.
Code:
vfs.zfs.arc.max: 268435456
vm.pfault_oom_attempts: 3
vm.pageout_oom_seq: 12
vm.pfault_oom_wait: 10

Can help me debug/suggestion/tip for troubleshoot?

Thanks very much.
 
MySQL can chew through memory - so that might be the culprit.

I'd make a script that appends top's output every minute to a log file and see if you can find the culprit after the crash. Or just keep top up and running and see where there seems to be any pressure.

For a large and busy MySQL database on top of ZFS then 8GB of RAM might not be enough without tuning - especially if you are also running a web server with PHP as well. And other daemons like memcached. But if all of those aren't very busy and the database is small, then might be something else.
 
MySQL can chew through memory - so that might be the culprit.

I'd make a script that appends top's output every minute to a log file and see if you can find the culprit after the crash. Or just keep top up and running and see where there seems to be any pressure.

For a large and busy MySQL database on top of ZFS then 8GB of RAM might not be enough without tuning - especially if you are also running a web server with PHP as well. And other daemons like memcached. But if all of those aren't very busy and the database is small, then might be something else.
Thanks you.

My MySQL aren't busy and I had tried to configure MySQL to using minimum amount of memory before several year and work fine with FreeBSD 12 and earlier version.
Therefore I trying to find the culprit in entire system.:-/

Behavior of kill process is keep system running I think.
But actual result is system freeze, it is not make sense.o_O
 
Yes, you are right - hard freeze would be unlikely for out-of-memory issue.

Is this new hardware or old? Is it the same machine you used for 12.x?

If it was working and has stopped, what has changed?

Possibly more likely hardware? Drives all OK?
 
Yes, you are right - hard freeze would be unlikely for out-of-memory issue.

Is this new hardware or old? Is it the same machine you used for 12.x?

If it was working and has stopped, what has changed?

Possibly more likely hardware? Drives all OK?
Hardware never change...
New version of the software(e.g: MySQL and other) consumes more memory than before maybe.:-/

Monitor continue:)
 
I'd check the drive(s) with sysutils/smartmontools. Even if it's just to rule them out as a possible cause. Run smartctl -a /dev/da0 (may need to adjust the device name). It might be much to copy/paste, so smartctl -a /dev/da0 | nc termbin.com 9999 would upload the output to termbin.
 
[...] System freeze until hardware reset.
My system installed physical memory size: 8GB and SWAP size: 2GB
Using (root on) ZFS and ARC
What is your ZFS setup? Using mirrors I presume?

[...] Behavior of kill process is keep system running I think.
But actual result is system freeze, it is not make sense.o_O
When the system freezes, can you access it remotely (ssh)?

When the system looks as being in a freezed state, a severely memory starved ZFS system (also with a relative small swap space of 2GB) may take a while to get responsive again.
How much time (approximately) is there between when you notice the system freeze and you do a hardware reset?
 
Why only 2 GB swapspace when running mysql and memcached?

I mean the error message is not the OOM killer, but the log reports just fit a full swapspace very well.
 
I'd check the drive(s) with sysutils/smartmontools. Even if it's just to rule them out as a possible cause. Run smartctl -a /dev/da0 (may need to adjust the device name). It might be much to copy/paste, so smartctl -a /dev/da0 | nc termbin.com 9999 would upload the output to termbin.
Hi SirDice

Please reference https://termbin.com/n2al

I'm surprise about attribute of Power_Cycle_Count and Power-Off_Retract_Count
Because the driver using for server and 24/7 power-up, power cycle times is low, will be damage?

System not use smartctl frequency due kern_securelevel="2"

EDIT:
Follows richardtoohey2 information, attribute value is weird,
I regenerate report by smartctl -x ... (Prints all SMART and non-SMART information about the device.)
Please reference new report https://termbin.com/8cvj7

Thanks a lot.
Neko
 
What is your ZFS setup? Using mirrors I presume?


When the system freezes, can you access it remotely (ssh)?

When the system looks as being in a freezed state, a severely memory starved ZFS system (also with a relative small swap space of 2GB) may take a while to get responsive again.
How much time (approximately) is there between when you notice the system freeze and you do a hardware reset?
What is your ZFS setup? Using mirrors I presume?
>> Only one driver.

When the system freezes, can you access it remotely (ssh)?
>> I can not access remotely, therefore ask my brother to press the reset button?

When the system looks as being in a freezed state, a severely memory starved ZFS system (also with a relative small swap space of 2GB) may take a while to get responsive again.
How much time (approximately) is there between when you notice the system freeze and you do a hardware reset?
>> Within the first few hours could be ping my server, but later could not(no reply), after 12 hours(approximately) of continue ping fail, ask my brother help to hard reset.

Thanks a lot.
Neko
 
Intel 330s - looks like they were released 10 years ago - how old are the drive(s)?

Is this a custom kernel?
It bought at 2014, but storage(no power) 2~3 years I remember.
With mentioned Power_Cycle_Count and Power-Off_Retract_Count, the Power_On_Hours_and_Msec can not believe?

Yes, I using custom kernel (device ada included).

Thanks a lot.
 
Why only 2 GB swapspace when running mysql and memcached?

I mean the error message is not the OOM killer, but the log reports just fit a full swapspace very well.
Ans: 2 GB swapspace is not use almost (Configuration of all daemon from 256MB RAM machine and few modified to fit higher memory machine)

Yes, I think so full swapspace problem, It caused by problem software (memory leak?) possible.
Therefore monitoring to find which one causes.

Thanks a lot.
 
Hi All

I tried to capture result of top per minute and noteworthy point found as follows.
  1. 2023/2/25 Sat 03:15:34 CST
    last pid: 48; load averages: 1.16, 1.85, 1.28 up 23+16:03:05 03:15:34
    128 processes: 1 running, 119 sleeping, 3 stopped, 5 zombie
    CPU: 0.8% user, 0.5% nice, 0.9% system, 0.7% interrupt, 97.2% idle
    Mem: 247M Active, 1499M Inact, 100M Laundry, 3790M Wired, 104K Buf, 2206M Free
    ARC: 348M Total, 119M MFU, 65M MRU, 4832K Anon, 2624K Header, 157M Other
    39M Compressed, 216M Uncompressed, 5.52:1 Ratio
    Swap: 2048M Total, 625M Used, 1423M Free, 30% Inuse

    PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
    39131 mysql 57 20 0 1132M 510M select 1 86:41 0.00% /usr/local/libexec/mysqld --defaults-extra-file=/usr/local/etc/mysql/my.cnf --basedir=/usr/local --datadir=/home/sqldb/data --plugin-dir=/usr/local/lib/mysql/plugin -
    -user=mysql --log-error=sql.epopen.com.err --open-files-limit=12K --pid-file=/var/run/mysql-server/pid --socket=/tmp/mysql.sock --port=3306
    36 www 1 22 0 232M 70M accept 1 1:35 0.00% php-fpm: pool neko_push_down (php-fpm)
    37 www 1 25 0 232M 66M accept 1 0:24 0.00% php-fpm: pool neko_push_down (php-fpm)
    92231 www 1 20 0 232M 65M accept 1 0:11 0.00% php-fpm: pool neko_push_down (php-fpm)
    96204 root 52 20 0 106M 44M select 0 98:16 0.00% /usr/local/bin/python3.9 /usr/local/bin/fail2ban-server --async -b -s /var/run/fail2ban.sock -p /var/run/fail2ban.pid --loglevel INFO --logtarget /var/log/fail2ban.lo
    g --syslogsocket auto
    57515 root 21 31 0 278M 42M uwait 0 0:02 0.00% /usr/local/libexec/git-core/git merge --ff-only FETCH_HEAD
  2. 2023/2/25 Sat 03:16:35 CST
    last pid: 78361; load averages: 1.06, 1.69, 1.25 up 23+16:04:06 03:16:35
    125 processes: 1 running, 116 sleeping, 3 stopped, 5 zombie
    CPU: 0.8% user, 0.5% nice, 0.9% system, 0.7% interrupt, 97.2% idle
    Mem: 235M Active, 1500M Inact, 100M Laundry, 5832M Wired, 104K Buf, 176M Free
    ARC: 375M Total, 143M MFU, 52M MRU, 4807K Anon, 2778K Header, 173M Other
    41M Compressed, 243M Uncompressed, 5.86:1 Ratio
    Swap: 2048M Total, 625M Used, 1423M Free, 30% Inuse

    PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
    39131 mysql 57 20 0 1132M 510M select 1 86:41 0.00% /usr/local/libexec/mysqld --defaults-extra-file=/usr/local/etc/mysql/my.cnf --basedir=/usr/local --datadir=/home/sqldb/data --plugin-dir=/usr/local/lib/mysql/plugin -
    -user=mysql --log-error=sql.epopen.com.err --open-files-limit=12K --pid-file=/var/run/mysql-server/pid --socket=/tmp/mysql.sock --port=3306
    36 www 1 22 0 232M 70M accept 1 1:35 0.00% php-fpm: pool neko_push_down (php-fpm)
    37 www 1 25 0 232M 66M accept 1 0:24 0.00% php-fpm: pool neko_push_down (php-fpm)
    92231 www 1 20 0 232M 65M accept 1 0:11 0.00% php-fpm: pool neko_push_down (php-fpm)
    96204 root 52 20 0 106M 44M select 1 98:16 0.00% /usr/local/bin/python3.9 /usr/local/bin/fail2ban-server --async -b -s /var/run/fail2ban.sock -p /var/run/fail2ban.pid --loglevel INFO --logtarget /var/log/fail2ban.lo
    g --syslogsocket auto
    57515 root 19 31 0 278M 42M uwait 0 0:03 0.00% /usr/local/libexec/git-core/git merge --ff-only FETCH_HEAD
  3. 2023/2/25 Sat 03:17:38: CST
    last pid: 48; load averages: 11.30, 3.75, 2.01 up 23+16:05:13 03:17:42
    125 processes: 6 running, 111 sleeping, 3 stopped, 5 zombie
    CPU: 0.8% user, 0.5% nice, 0.9% system, 0.7% interrupt, 97.2% idle
    Mem: 18M Active, 35M Inact, 68M Laundry, 1245M Wired, 104K Buf, 6475M Free
    ARC: 219M Total, 151M MFU, 40M MRU, 4305K Anon, 8139K Header, 17M Other
    151M Compressed, 2801M Uncompressed, 18.58:1 Ratio
    Swap: 2048M Total, 1108M Used, 940M Free, 54% Inuse

    PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
    57515 root 1 20 0 278M 32M RUN 0 0:03 0.00% /usr/local/libexec/git-core/git merge --ff-only FETCH_HEAD
    96204 root 52 20 0 106M 23M select 1 98:16 0.00% /usr/local/bin/python3.9 /usr/local/bin/fail2ban-server --async -b -s /var/run/fail2ban.sock -p /var/run/fail2ban.pid --loglevel INFO --logtarget /var/log/fail2ban.lo
    g --syslogsocket auto
    39131 mysql 57 20 0 1132M 19M select 1 86:41 0.00% /usr/local/libexec/mysqld --defaults-extra-file=/usr/local/etc/mysql/my.cnf --basedir=/usr/local --datadir=/home/sqldb/data --plugin-dir=/usr/local/lib/mysql/plugin -
    -user=mysql --log-error=sql.epopen.com.err --open-files-limit=12K --pid-file=/var/run/mysql-server/pid --socket=/tmp/mysql.sock --port=3306
    29 root 1 -52 r0 13M 13M nanslp 1 0:05 0.00% /usr/sbin/watchdogd -t 180 --pretimeout 90 --softtimeout --pretimeout-action log,printf --softtimeout-action log,panic,printf
    36413 nobody 1 20 0 43M 6068K RUN 0 1:45 0.00% postfwd3::policy (perl)
    63872 bind 6 52 0 66M 4128K sigwai 1 0:28 0.00% /usr/local/sbin/named -u bind -c /usr/local/etc/namedb/named.conf
Item 1 -> 2: 3790M Wired -> 5832M Wired, therefore 2206M Free -> 176M Free
Item 2 -> 3: 30% Inuse of Swap-> 54% Inuse of Swap.

Daemon using memory same, Wired using memory increase.
But Wired using for system, no idea which one using.

Thanks a lot.
Neko
 
Was this ever resolved?

I am having the same issue after upgrading from 12.4 to 13.3

User processes which were memory hungry, but worked, were fine on the former install; they're now being killed off with errors "failed to reclaim memory" and "a thread waited too long to allocate a page".

Occasionally other unrelated/system processes, like getty, or even init, are also killed, leaving the system in an unstable state.

There's 8GB of swap, and no swap errors logged, so it appears it's not just simple OOM.

Did not have this issue until after the upgrade.

Edit: One idea I've had is mmap, as there's multiple processes accessing a couple of read-only files. Has any behaviour, such as default flags, changed there? Is it possible that for some reason the data is being cached individually rather than collectively? Or perhaps, given the more specific memory errors rather than just being absolutely OOM, it's unrelated.

Code:
      bitvector_mptr = mmap(NULL, bitvector_filesize, PROT_READ, MAP_PRIVATE | MAP_NOCORE | MAP_PREFAULT_READ, bitvector_fd, 0);
...
      mptr = mmap(NULL, hashindex_filesize, PROT_READ, MAP_PRIVATE | MAP_NOCORE, fd, 0);
 
Absolute carnage here... 30 mins after a reboot the system has killed off all applications, as well as the "screens" they run under, and for some reason it also killed unbound.

This never happened under 12.x. The server is currently unusable. I'm not sure what to do next...

Code:
Apr 14 00:06:29 softdev kernel: pid 9238 (<application_1>), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:11:48 softdev kernel: pid 10030 (<application_1>), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:12:29 softdev kernel: pid 9653 (<application_2>), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:12:59 softdev kernel: pid 9654 (<application_1>), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 10106 (sort), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 3321 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 3281 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 10088 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 10084 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 10083 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 10082 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 10091 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 10089 (php7), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 622 (local-unbound), jid 0, uid 59, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 3330 (screen-4.9.1), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 3314 (screen-4.9.1), jid 0, uid 0, was killed: a thread waited too long to allocate a page
Apr 14 00:27:16 softdev kernel: pid 3277 (screen-4.9.1), jid 0, uid 0, was killed: a thread waited too long to allocate a page
 
See if you can get back to 12.x (but it’s EOL now).

Could try 14.0 but might just cause more issues,

Once you are back on 12.x set up mirror test environment where you can upgrade to 13.x and see if you can find the culprit.

What do you mean by “no swap errors”? No swap usage or something else?
 
Back
Top