Indeed it is. To a byte.It's not a bug, it's a feature ;-)
I don't think there's any point to go deeper on this topic, all information is in this thread for future readers.
Indeed it is. To a byte.It's not a bug, it's a feature ;-)
I appreciate the efforts that the community put into looking at this crash.I don't think there's any point to go deeper on this topic, all information is in this thread for future readers.
memtest86
over lunch. No errors showed up. I will run the full test suite with four passes over night.jbo@fbsd_beefy01 /u/h/jbo> mcelog --no-dmi --asci --file /var/crash/core.txt.0
Hardware event. This is not a software error.
CPU 8 BANK 0
ADDR 1ffff80bbf800
MCG status:
STATUS 9400004000040150 MCGSTATUS 0
MCGCAP c0c APICID 8 SOCKETID 0
CPUID Vendor Intel Family 6 Model 158 Step 10
cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"
Personally I suspect the Quadro in this case. It might even not to be the actual hardware, but GPU-driver combination. As I have written here before, if you could just temporarily change the GPU for some other model, that would give a good comparison point.I have been running (and still am) sysutils/stress for > 1h and the system is still running/stable.
The CPU temperature never exceeds 56.0C. There's a massive Noctua NH-D15 cooler on there. While I get your comment regarding cleaning sockets this would at least show that cooling performance is adequate.
Unfortunately these days I have few GPUs just "lying around". The only real options I'd have would be an old Quadro K2000, a GTX 1080 or a Quadro M1000 if really necessary.Personally I suspect the Quadro in this case. It might even not to be the actual hardware, but GPU-driver combination. As I have written here before, if you could just temporarily change the GPU for some other model, that would give a good comparison point.
make -j12 buildworld
just to stress the machine. Build other projects you work on in parallel. Try to crash that machine with whatever workload you can think of you do on that system.dmesg -M /var/crash/vmcore.{N}
.lstopo-no-graphics
? gcc-arm-embedded
was invoking the linker starts to tell me that this might not necessarily be a hardware fault. Of course I understand that all the signs are pointing that way tho.Here you go:Could you install devel/hwloc2 using your prefered method of installing packages and show the output oflstopo-no-graphics
?
jbo@fbsd_beefy01 /u/h/jbo> sudo lstopo-no-graphics
Failed to initialize LevelZero in ze_init(): 2013265921
Machine (62GB total)
Package L#0
NUMANode L#0 (P#0 62GB)
L3 L#0 (12MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#2)
PU L#3 (P#3)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#4)
PU L#5 (P#5)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#6)
PU L#7 (P#7)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#8)
PU L#9 (P#9)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#10)
PU L#11 (P#11)
HostBridge
PCIBridge
PCI 01:00.0 (VGA)
PCIBridge
PCI 02:00.0 (NVMExp)
PCI 00:17.0 (SATA)
PCIBridge
PCI 03:00.0 (NVMExp)
PCIBridge
PCI 04:00.0 (Ethernet)
PCIBridge
PCIBridge
PCIBridge
PCI 08:00.0 (Ethernet)
PCIBridge
PCI 09:00.0 (Ethernet)
PCI 00:1f.6 (Ethernet)
Here are the last few messages of each vmcore:Oh, in my last post I forgot to mention: you can read the dmesg from vmcore also by running:dmesg -M /var/crash/vmcore.{N}
.
Fatal trap 1: privileged instruction fault while in kernel mode
cpuid = 5; apic id = 05
instruction pointer = 0x20:0xffffffff80f275ed
stack pointer = 0x0:0xfffffe01c65a6840
frame pointer = 0x0:0xfffffe01c65a6930
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 34881 (cc1)
trap number = 1
panic: privileged instruction fault
cpuid = 5
time = 1634053989
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108a67e at trap+0x8e
#5 0xffffffff81061958 at calltrap+0x8
#6 0xffffffff80f2741d at vm_fault_trap+0x6d
#7 0xffffffff8108b3b8 at trap_pfault+0x1f8
#8 0xffffffff8108a9ed at trap+0x3fd
#9 0xffffffff81061958 at calltrap+0x8
Uptime: 3h34m36s
Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address = 0xffffffffffffff83
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff8108b55a
stack pointer = 0x0:0xfffffe02098d1ae0
frame pointer = 0x0:0xfffffe02098d1af0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 13671 (cc1)
trap number = 12
panic: page fault
cpuid = 7
time = 1634057457
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff81061958 at calltrap+0x8
Uptime: 56m52s
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 8
MCA: CPU 8 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80f29480
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 4
MCA: CPU 4 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80f27a80
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 6
MCA: CPU 6 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80c26843
MCA: Bank 4, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x1014690
MCA: Misc 0x1014690
timeout stopping cpus
panic: Unrecoverable machine check exception
cpuid = 3
time = 1634058589
KDB: stack backtrace:
Uptime: 12m39s
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 9
MCA: CPU 9 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80be7ad4
MCA: Bank 0, Status 0x9400004000040150
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 4
MCA: CPU 4 COR (1) ICACHE L0 IRD error
MCA: Address 0x1ffff80f35f89
MCA: Bank 4, Status 0xbe00000000800400
MCA: Global Cap 0x0000000000000c0c, Status 0x0000000000000005
MCA: Vendor "GenuineIntel", ID 0x906ea, APIC ID 3
MCA: CPU 3 UNCOR PCC internal timer error
MCA: Address 0x1014654
MCA: Misc 0x1014654
timeout stopping cpus
panic: Unrecoverable machine check exception
cpuid = 3
time = 1634058971
KDB: stack backtrace:
Uptime: 5m21s
Fatal trap 12: page fault while in kernel mode
cpuid = 7; apic id = 07
fault virtual address = 0xffffffffffffff85
fault code = supervisor write data, page not present
instruction pointer = 0x20:0xffffffff80d01103
stack pointer = 0x28:0xfffffe01a2b2f660
frame pointer = 0x28:0xfffffe01a2b2f6d0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 4455 (cbsd)
trap number = 12
panic: page fault
cpuid = 7
time = 1634059188
KDB: stack backtrace:
#0 0xffffffff80c574c5 at kdb_backtrace+0x65
#1 0xffffffff80c09ea1 at vpanic+0x181
#2 0xffffffff80c09d13 at panic+0x43
#3 0xffffffff8108b1b7 at trap_fatal+0x387
#4 0xffffffff8108b20f at trap_pfault+0x4f
#5 0xffffffff8108a86d at trap+0x27d
#6 0xffffffff81061958 at calltrap+0x8
#7 0xffffffff80d00c63 at vn_io_fault_doio+0x43
#8 0xffffffff80cfcb5c at vn_io_fault1+0x15c
#9 0xffffffff80cfa234 at vn_io_fault+0x1a4
#10 0xffffffff80c76798 at dofilewrite+0x88
#11 0xffffffff80c7630c at sys_write+0xbc
#12 0xffffffff8108babc at amd64_syscall+0x10c
#13 0xffffffff8106227e at fast_syscall_common+0xf8
Uptime: 2m34s
0xffffffff80d01100 <+224>: mov rdi,QWORD PTR [rbp-0x30]
0xffffffff80d01104 <+228>: test rdi,rdi
0xffffffff80d01107 <+231>: je 0xffffffff80d01120 <vn_write+256>
0xffffffff80d01103
, which is
(kgdb) x/i 0xffffffff80d01103
0xffffffff80d01103 <vn_write+227>: ror BYTE PTR [rax-0x7b],1
(kgdb)
$rax - 0x7b was 0xffffffffffffff85
. 0xffffffff8108b557 <+39>: ret
0xffffffff8108b558 <+40>: mov rdi,rbx
0xffffffff8108b55b <+43>: add rsp,0x8
0xffffffff8108b55f <+47>: pop rbx
0xffffffff8108b55a
which is:
(kgdb) x/12i 0xffffffff8108b55a
0xffffffff8108b55a <trap_check+42>: fisttp WORD PTR [rax-0x7d]
0xffffffff8108b55d <trap_check+45>: (bad)
In my opinion I didn't. Here'sYou didn't do any OS upgrade since then, correct ? Just so that my VM is still on the same version than yours.
uname -a
from right now:jbo@fbsd_beefy01 /u/h/j/p/malloy (main)> uname -a
FreeBSD fbsd_beefy01 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021 root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
Are these the backtraces listed in /var/crash/core.txt.{N}?It wouldn't hurt to have the backtrace for given crashes (you did paste bt for crash 0).
Any ideas on how to provocate this? The machine in question has had quite a beating the last few days running a multitude of different stress tests, regular workloads, intentionally running poudriere builds along my other builds and so on.It is interesting to know what was gcc doing to rub the CPU the wrong way but I'd put my wager on faulty CPU.
stress -m 262144
or something like this. You have plenty of RAM so you need to really stress it. Or use --vm-bytes
to allocate larger chunks of memory.