We have several SuperMicro machines and they're all running fine. Except two machines, they appear to run FreeBSD 9.3-RELEASE just fine but we get a lot of MCA errors in the logs. Most likely memory errors and would need to be replaced.
One machine is a couple of years old, so its not entirely unexpected:
So it looks like bank 7, 9 and 10 are broken. Running sysutils/mcelog doesn't tell much more than that.
The other machine is fairly new (new enough to still be in warranty):
Again, it looks like bank 7,9 and 10 are broken (I may not have pasted everything).
Now, I can imagine one bank dying. Three in one machine, although not entirely impossible, is just unlikely. And the same banks on two different machines, one old, one new? Extremely unlikely.
So, I'm wondering if this might be something else. I'd also like to find out which physical bank corresponds with the bank numbers in the MCA errors, sysutils/mcelog doesn't tell me much more than can already be learned from the log.
One machine is a couple of years old, so its not entirely unexpected:
Code:
Oct 12 05:18:26 db2.example.com MCA: Bank 7, Status 0xcc0d16c000010091
Oct 12 05:18:26 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:18:26 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 36
Oct 12 05:18:26 db2.example.com MCA: CPU 12 COR (13403) OVER RD channel 1 memory error
Oct 12 05:18:26 db2.example.com MCA: Address 0x343353cec0
Oct 12 05:18:26 db2.example.com MCA: Misc 0x142661286
Oct 12 05:19:53 db2.example.com MCA: Bank 7, Status 0xcc010100000400a1
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 51
Oct 12 05:19:53 db2.example.com MCA: CPU 21 COR (1028) OVER WR channel 1 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x343353ce80
Oct 12 05:19:53 db2.example.com MCA: Misc 0x80289686
Oct 12 05:19:53 db2.example.com MCA: Bank 9, Status 0x8c000050000800c0
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 50
Oct 12 05:19:53 db2.example.com MCA: CPU 20 COR (1) MS channel 0 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x3efbda5500
Oct 12 05:19:53 db2.example.com MCA: Misc 0x90000000000208c
Oct 12 05:19:53 db2.example.com MCA: Bank 7, Status 0xcc010100000400a1
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 37
Oct 12 05:19:53 db2.example.com MCA: CPU 13 COR (1028) OVER WR channel 1 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x343353ce80
Oct 12 05:19:53 db2.example.com MCA: Misc 0x80289686
Oct 12 05:19:53 db2.example.com MCA: Bank 9, Status 0x8c000050000800c0
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 49
Oct 12 05:19:53 db2.example.com MCA: CPU 19 COR (1) MS channel 0 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x3efbda5500
Oct 12 05:19:53 db2.example.com MCA: Misc 0x90000000000208c
Oct 12 05:19:53 db2.example.com MCA: Bank 9, Status 0x8c000050000800c0
Oct 12 05:19:53 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 05:19:53 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 36
Oct 12 05:19:53 db2.example.com MCA: CPU 12 COR (1) MS channel 0 memory error
Oct 12 05:19:53 db2.example.com MCA: Address 0x3efbda5500
Oct 12 05:19:53 db2.example.com MCA: Misc 0x90000000000208c
Oct 12 06:18:26 db2.example.com MCA: Bank 7, Status 0xcc15b98000010091
Oct 12 06:18:26 db2.example.com MCA: Global Cap 0x0000000001000c1b, Status 0x0000000000000000
Oct 12 06:18:26 db2.example.com MCA: Vendor "GenuineIntel", ID 0x306e4, APIC ID 36
Oct 12 06:18:26 db2.example.com MCA: CPU 12 COR (22246) OVER RD channel 1 memory error
The other machine is fairly new (new enough to still be in warranty):
Code:
Oct 8 15:48:25 db4.example.com MCA: Bank 7, Status 0xcc00008000010090
Oct 8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct 8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 18
Oct 8 15:48:25 db4.example.com MCA: CPU 14 COR (2) OVER RD channel 0 memory error
Oct 8 15:48:25 db4.example.com MCA: Address 0x50ac272080
Oct 8 15:48:25 db4.example.com MCA: Misc 0x1523aba86
Oct 8 15:48:25 db4.example.com MCA: Bank 9, Status 0x8c000051000800c0
Oct 8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct 8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 19
Oct 8 15:48:25 db4.example.com MCA: CPU 15 COR (1) MS channel 0 memory error
Oct 8 15:48:25 db4.example.com MCA: Address 0x50ac272000
Oct 8 15:48:25 db4.example.com MCA: Misc 0x122940200020228c
Oct 8 15:48:25 db4.example.com MCA: Bank 7, Status 0xcc00008000010090
Oct 8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct 8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 17
Oct 8 15:48:25 db4.example.com MCA: CPU 13 COR (2) OVER RD channel 0 memory error
Oct 8 15:48:25 db4.example.com MCA: Address 0x50ac272080
Oct 8 15:48:25 db4.example.com MCA: Misc 0x1523aba86
Oct 8 15:48:25 db4.example.com MCA: Bank 9, Status 0x8c000051000800c0
Oct 8 15:48:25 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct 8 15:48:25 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16
Oct 8 15:48:25 db4.example.com MCA: CPU 12 COR (1) MS channel 0 memory error
Oct 8 15:48:25 db4.example.com MCA: Address 0x50ac272000
Oct 8 15:48:25 db4.example.com MCA: Misc 0x122940200020228c
Oct 8 16:41:06 db4.example.com MCA: Bank 7, Status 0x8c00004000010090
Oct 8 16:41:06 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct 8 16:41:06 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 16
Oct 8 16:41:06 db4.example.com MCA: CPU 12 COR (1) RD channel 0 memory error
Oct 8 16:41:06 db4.example.com MCA: Address 0x50ac272080
Oct 8 16:41:06 db4.example.com MCA: Misc 0x150181886
Oct 8 17:30:50 db4.example.com MCA: Bank 9, Status 0x8c000051000800c0
Oct 8 17:30:50 db4.example.com MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Oct 8 17:30:50 db4.example.com MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 19
Now, I can imagine one bank dying. Three in one machine, although not entirely impossible, is just unlikely. And the same banks on two different machines, one old, one new? Extremely unlikely.
So, I'm wondering if this might be something else. I'd also like to find out which physical bank corresponds with the bank numbers in the MCA errors, sysutils/mcelog doesn't tell me much more than can already be learned from the log.