cp generates core dump within jails

MetaOxy · Jun 13, 2024

Hi there,

I've witnessed core dump several times and each time in the same situation.

I have two jails hosting a "cloud service" ... and on both jails I use a script to update the software (the same one)
When copying the directory containing "Nextcloud" for backup, I've had core dumps several times and on two different jails during a simple copy.

I don't know exactly how to read the core dump.

At first I thought it was due to the quality of the nvme installed (which was a suspicious no-name device made in china), but I've since replaced it with "solid" samsung device.

But problem is still there and I only notice it when copying this directory from two different jails (with the same purpose but for different domains)

Code:

root@nextcloud-mlfa:~ # file cp.core
cp.core: ELF 64-bit LSB core file, x86-64, version 1 (FreeBSD), FreeBSD-style, from 'cp -rp /usr/local/www/nextcloud/ /usr/local/www/nextcloud.back/', pid=25229

root@nextcloud-mlfa:~ # gdb -q `which cp` ./cp.core
Reading symbols from /bin/cp...
(No debugging symbols found in /bin/cp)
[New LWP 100498]
Core was generated by `cp -rp /usr/local/www/nextcloud/ /usr/local/www/nextcloud.back/'.
Program terminated with signal SIGBUS, Bus error.
Object-specific hardware error.
#0  0x0000117f170b3360 in fts_read () from /lib/libc.so.7
(gdb) bt
#0  0x0000117f170b3360 in fts_read () from /lib/libc.so.7
#1  0x00001176f53d78a7 in ?? ()
#2  0x00001176f53d76fe in ?? ()
#3  0x0000117f1709eafa in __libc_start1 () from /lib/libc.so.7
#4  0x00001176f53d72dd in ?? ()
(gdb) q

Code:

FreeBSD nextcloud-mlfa 14.0-RELEASE-p6 FreeBSD 14.0-RELEASE-p6 #0: Tue Mar 26 20:26:20 UTC 2024     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd
64.amd64/sys/GENERIC amd64

However, I noticed that the problem started to appear when I changed hardware and migrated my jails from my old I5-based box to an N100 ...

Could this be linked to faulty memory ? (Memory shipped with the the box is crucial branded)

Thanks for you help.

cy@ · Jun 13, 2024

Looks like you were doing a cp -r (or cp -R). This could be anything from a bug in cp to faulty memory to some corrupted metadata on the filesystem that hadn't panicked the kernel but passed bad data back to fts_read(3). We will only know for sure to inspect the dump against debugging symbols. If you were running -CURRENT, you'd have the debugging symbols but obviously your running either -STABLE or -RELEASE. We don't know the offset into fts_read(3) to know what caused the bad memory access.

If you can rebuild libc, libsys, and cp itself with debugging symbols you'd be able to tell where in fts_read() the problem was. That would give you an indication whether the problem was a bug or hardware fault.

Generally a bus fault happens in text whereas segfault happens in data.

Try a full fsck of UFS and zpool scrub of ZFS. If it is memory this is a risk but the alternative is to spend some $$$ (which I, personally, am loathed to do).

covacat · Jun 13, 2024

you can get debug symbols from

https://download.freebsd.org/releases/amd64/14.0-RELEASE/base-dbg.txz

MetaOxy · Jun 13, 2024

I made this , sorry if this is not good but I'm not very familiar with C and debuging :

I downloaded https://download.freebsd.org/releases/amd64/14.0-RELEASE/base-dbg.txz and extracted on my host :

Code:

% doas tar -C / -xpf base-dbg.txz

% ll /usr/lib/debug
total 43
drwxr-xr-x  2 root wheel   42B Nov 10  2023 bin
drwxr-xr-x  4 root wheel    6B Jun 13 18:22 boot
drwxr-xr-x  4 root wheel   78B Nov 10  2023 lib
drwxr-xr-x  2 root wheel    4B Nov 10  2023 libexec
drwxr-xr-x  2 root wheel   93B Nov 10  2023 sbin
drwxr-xr-x  8 root wheel    8B Nov 10  2023 usr
% file /usr/lib/debug/bin/cp.debug
/usr/lib/debug/bin/cp.debug: ELF 64-bit LSB shared object, x86-64, version 1 (FreeBSD), no program header, for FreeBSD 14.0 (1400097), FreeBSD-style, with debug_info, not stripped

Then

Code:

% gdb /usr/lib/debug/bin/cp.debug /usr/local/bastille/jails/nextcloud-mlfa/root/root/cp.core
GNU gdb (GDB) 14.1 [GDB v14.1 for FreeBSD]
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd14.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/debug/bin/cp.debug...

warning: core file may not match specified executable file.
[New LWP 100498]
Core was generated by `cp -rp /usr/local/www/nextcloud/ /usr/local/www/nextcloud.back/'.
Program terminated with signal SIGBUS, Bus error.
Object-specific hardware error.
#0 0x0000117f170b3360 in ?? ()
(gdb) bt
#0 0x0000117f170b3360 in ?? ()
#1 0x00001176f53db600 in ?? ()
#2 0x000025b0fccf14c0 in ?? ()
#3 0x00001176f53db658 in to ()
#4 0x000025b0fcc09000 in ?? ()
#5 0x0000117f157a4f20 in ?? ()
#6 0x00001176f53d78a7 in copy (argv=<optimized out>, type=(FILE_TO_DIR | DIR_TO_DNE | unknown: 0x117c), fts_options=-56571648, root_stat=0x25b0fca0c900) at /usr/src/bin/cp/cp.c:312
#7 0x0000000000000019 in ?? ()
#8 0x00000002f53d4e57 in ?? ()
#9 0x0000ffed00000006 in ?? ()
#10 0x0000117f157a4d10 in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb)

Does it ring a bell ?

#6 0x00001176f53d78a7 in copy (argv=<optimized out>, type=(FILE_TO_DIR | DIR_TO_DNE | unknown: 0x117c), fts_options=-56571648, root_stat=0x25b0fca0c900) at /usr/src/bin/cp/cp.c:312

Also :

I will run a full scrub of the zfs pool (but whole system has been reinstalled some week ago after switching nvme) and issue was present on the other nvme as well
Memory error could be an option will run a memtest offline.

BTW I I reproduced the problem again. The trigger really seems to be the migration of jails to the new machine (I've never seen a problem like this before on the previous host). This script was running for month without issue :/ Having said that, I have exactly the same problem in 2 different jails when I execute the same script and when I copy/backup the same directory ... it's weird. I don't have any problems anywhere else.

cp generates core dump within jails

MetaOxy

cy@

covacat

MetaOxy