I'm not sure if this is the right subforum for this, and if it's not, my apologies! I just didn't know where exactly this would fit in.
Here's my problem: I'm using the x265 command-line video encoder on FreeBSD 12.1-RELEASE-p1, and it works perfectly fine, unless I specify the
I don't really know much about debugging or anything, but I at least tried to find some clues using
I assume the 17 seconds in "read" are probably because I'm providing large amounts of uncompressed 8K video data to the encoder via a pipe. But _umtx_op consumed 26676 (CPU?) seconds in just a very short time. I tried to dig further and found lots and lots of those calls in truss output, here's a very tiny snippet:
I guess that 'Operation timed out' thing might be a part of the issue? But I'm unsure, it shows up every 50-100 or so system calls. There's over a million of those calls for less than 10 minutes of runtime in total...
This did not happen on FreeBSD 11.1-RELEASE before (using clang to compile x265). It also does not happen on modern Fedora 31 Linux (GCC 9) or on Microsoft Windows 10 1909 (MSVC++ 2017), same source code in every case. I tried several versions of clang as well as GCC on FreeBSD 12.1-RELEASE-p1 to see whether the compiler makes a difference, but it doesn't.
My x265 source code is modified to allow for higher resolutions however, so just to make sure, I tried the vanilla source code (much newer version 3.2.1+1 of x265) as well, and the problem stays the same! As soon as I specify
I know one might say that this is a problem I should report to the x265 developers, but given that this appears to be FreeBSD-specific somehow, I thought I'd ask here first.
Does anybody have an idea about how I can narrow the problem down further, or what might be causing this behavior?
This isn't a critical thing for me, because I don't need that parameter for production, but there is a specific test I'd like to run, which relies on it. Using that, I'd love to make a comparison between Windows, Linux and FreeBSD in terms of performance, but that won't make much sense with the encoder being in this state on FreeBSD.
I'd be thankful for any ideas!
Here's my problem: I'm using the x265 command-line video encoder on FreeBSD 12.1-RELEASE-p1, and it works perfectly fine, unless I specify the
--pme
parameter, which activates the "parallel motion estimation" feature of the encoder. Once this is active, the kernel load on the machine will rise significantly, eating up 30-40% even of an entire 32-core 64-thread processor. Let me give you a few specs regarding my test system:- OS: FreeBSD 12.1-RELEASE-p1 running the GENERIC kernel
- CPU: AMD Ryzen Threadripper 3970X (32 cores, 64 threads, baremetal)
- Architecture: amd64
- x265 version: Any from 2.5+48-bd438ce10843 up to 3.2.1+1-b5c86a64bbbe
- Compiler: clang (any version with C++11 support), GCC (any version with C++11 support)
- Assembler: yasm 1.3.0 and nasm 2.13.x or 2.14.x
I don't really know much about debugging or anything, but I at least tried to find some clues using
$ truss -c -D -H -s 256
and $ truss -D -H -s 256
. It seems the system gets stuck doing tons of _umtx_op system calls which take an enormous amount of time:
Code:
syscall seconds calls errors
thr_new 0.007680868 71 0
getcontext 0.000003631 1 0
getpid 0.000005630 2 0
__sysctl 0.000016060 4 0
issetugid 0.000007411 2 0
write 0.000235482 21 0
thr_self 0.000002880 1 0
sysarch 0.000003029 1 0
sigprocmask 0.000068142 17 0
sigaction 0.000014991 2 0
rtprio_thread 0.000002880 1 0
readlink 0.000004490 1 1
read 17.799654291 249306 0
pread 0.000005131 1 0
openat 0.000061553 13 3
open 0.000148284 4 1
munmap 0.010717130 47 0
mprotect 0.001165332 83 0
mmap 1.810752957 3062 0
madvise 0.000003110 1 0
getrlimit 0.000002850 1 0
fstat 0.000068339 13 0
close 0.000035990 10 0
_umtx_op 26676.881285215 747621 1076
------------- ------- -------
26696.511945676 1000286 1081
I assume the 17 seconds in "read" are probably because I'm providing large amounts of uncompressed 8K video data to the encoder via a pipe. But _umtx_op consumed 26676 (CPU?) seconds in just a very short time. I tried to dig further and found lots and lots of those calls in truss output, here's a very tiny snippet:
Code:
101861: 0.011574437 _umtx_op(0x800cb3db8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
101875: 0.001024685 _umtx_op(0x800d2cbb8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
102685: 0.249927998 _umtx_op(0x800cea650,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffd79b9d68) ERR#60 'Operation timed out'
101905: 0.021449979 _umtx_op(0x800cea350,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x0,0x0) = 0 (0x0)
101884: 0.008779694 _umtx_op(0x824c3dd00,UMTX_OP_MUTEX_WAKE2,0x0,0x0,0x0) = 0 (0x0)
101865: 0.000615929 _umtx_op(0x800d2b7b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
101895: 0.008443820 _umtx_op(0x800d527b8,UMTX_OP_NWAKE_PRIVATE,0x1,0x0,0x0) = 0 (0x0)
I guess that 'Operation timed out' thing might be a part of the issue? But I'm unsure, it shows up every 50-100 or so system calls. There's over a million of those calls for less than 10 minutes of runtime in total...
This did not happen on FreeBSD 11.1-RELEASE before (using clang to compile x265). It also does not happen on modern Fedora 31 Linux (GCC 9) or on Microsoft Windows 10 1909 (MSVC++ 2017), same source code in every case. I tried several versions of clang as well as GCC on FreeBSD 12.1-RELEASE-p1 to see whether the compiler makes a difference, but it doesn't.
My x265 source code is modified to allow for higher resolutions however, so just to make sure, I tried the vanilla source code (much newer version 3.2.1+1 of x265) as well, and the problem stays the same! As soon as I specify
--pme
, the encoder becomes really slow, and gets stuck because the CPU is being eaten up by the kernel load.I know one might say that this is a problem I should report to the x265 developers, but given that this appears to be FreeBSD-specific somehow, I thought I'd ask here first.
Does anybody have an idea about how I can narrow the problem down further, or what might be causing this behavior?
This isn't a critical thing for me, because I don't need that parameter for production, but there is a specific test I'd like to run, which relies on it. Using that, I'd love to make a comparison between Windows, Linux and FreeBSD in terms of performance, but that won't make much sense with the encoder being in this state on FreeBSD.
I'd be thankful for any ideas!