CentOS 6.2 system kernel panic(dead or freeze) on new hardware
Hi everyone,
We have a new cluster some days ago, use 2-ways AMD 6272 16 cores CPU, running CentOS 6.2 x64, about 50 servers . And system kernel panic or freeze occured random. some log my colleague send to me attached below log: ***************** 2012/7/11 ++++++++++++++++++++++++++++++ c57 ,c45,c56,c7,c87,c107,c54,c69,c21,c105,c96,c80,c39,c105,c21,c6,c13,c104,c18,c19,c9, [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) [ic56][Hardware Error]: MC4_STATUS[Over|CE|-|-|AddrV|-|-|CECC]: 0xd430c000ff080813 [ic56][Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB. [ic56]EDAC amd64 MC0: CE ERROR_ADDRESS= 0x29aee9d40 [ic56]EDAC MC0: CE page 0x29aee9, offset 0xd40, grain 0, syndrome 0xff61, row 3, channel 1, label "": amd64_edac ++++++++++++++ IB reboot c33 +++++++++++++ dead c31 (2.45h kernel panic detected hard lockup cpu 27) c94 c95 ( 2.50h kernel panic detected hard lockup cpu) c98 (3h kernel panic detected hard lockup cpu22) c73 c15 c66 +++++++++++++++++++++++++++ c3 2 h c2 3.27 h BUG: soft lockup - CPU#6 stuck for 67s! [IMB-MPI1:17787] [c3]Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_connt rack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT CpuSpy(P)(U) xt_CHECKSUM iptable_mangle knem(U) bridge autofs4 s unrpc 8021q garp stp llc iptable_filter ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) vhos t_net macvtap macvlan tun kvm_amd kvm sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci dm_mirro r dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [c3]CPU 6 [c3]Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_connt rack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT CpuSpy(P)(U) xt_CHECKSUM iptable_mangle knem(U) bridge autofs4 s unrpc 8021q garp stp llc iptable_filter ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) vhos t_net macvtap macvlan tun kvm_amd kvm sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci dm_mirro r dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [c3] [c3]Pid: 17787, comm: IMB-MPI1 Tainted: P ---------------- 2.6.32-220.13.1.el6.x86_64 #1 MICRO-STAR I NTERNATIONAL CO., LTD MS-91F2/MS-91F2 [c3]RIP: 0010:[<ffffffff814ef83e>] [<ffffffff814ef83e>] _spin_lock+0x1e/0x30 [c3]RSP: 0018:ffff8809732459f8 EFLAGS: 00000293 [c3]RAX: 0000000000000129 RBX: ffff8809732459f8 RCX: 0000000000000000 [c3]RDX: 0000000000000127 RSI: ffff880973245a78 RDI: ffffffff81fab308 [c3]RBP: ffffffff8100bc0e R08: ffffffff81c002c0 R09: 0000000000000000 [c3]R10: 0000000000000010 R11: 0000000000001000 R12: 0000000000000000 [c3]R13: 0000000000001000 R14: ffffffff8113bd57 R15: ffff880973245a28 [c3]FS: 00002b26791b4480(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000 [c3]CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [c3]CR2: 00002b267c31e230 CR3: 000000015c096000 CR4: 00000000000406e0 [c3]DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [c3]DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [c3]Process IMB-MPI1 (pid: 17787, threadinfo ffff880973244000, task ffff880973243540) [c3]Stack: [c3] ffff880973245a58 ffffffff81148684 ffff88000004c218 ffff880973245a38 [c3]<0> ffff880973245a18 ffff880973245a18 ffff880415c980c0 0000000000000001 [c3]<0> 0000000000000030 ffff8800fc8c5000 ffff880c2e572cb0 ffff880c2e572d58 [c3]Call Trace: [c3] [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0 [c3] [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180 [c3] [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530 [c3] [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130 [c3] [<ffffffff81072cf0>] ? __is_ram+0x0/0x10 [c3] [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40 [c3] [<ffffffff81046502>] ? reserve_memtype+0x492/0x590 [c3] [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40 [c3] [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0 [c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0 [c3] [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90 [c3] [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0 [c3] [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190 [c3] [<ffffffff81177f58>] ? alloc_file+0x98/0xe0 [c3] [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200 [c3] [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib] [c3] [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs] [c3] [<ffffffff811421c0>] ? mmap_region+0x400/0x590 [c3] [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380 [c3] [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0 [c3] [<ffffffff8117724c>] ? sys_write+0x7c/0x90 [<ffffffff81010469>] ? sys_mmap+0x29/0x30 [c3] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b [c3]Code: 00 00 00 01 74 05 e8 62 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c 1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 [c3]Call Trace: [c3] [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0 [c3] [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180 [c3] [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530 [c3] [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130 [c3] [<ffffffff81072cf0>] ? __is_ram+0x0/0x10 [c3] [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40 [c3] [<ffffffff81046502>] ? reserve_memtype+0x492/0x590 [c3] [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40 [c3] [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0 [c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0 [c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0 [c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0 [c3] [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90 [c3] [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0 [c3] [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190 [c3] [<ffffffff81177f58>] ? alloc_file+0x98/0xe0 [c3] [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200 [c3] [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib] [c3] [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs] [c3] [<ffffffff811421c0>] ? mmap_region+0x400/0x590 [c3] [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380 [c3] [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0 [c3] [<ffffffff8117724c>] ? sys_write+0x7c/0x90 [c3] [<ffffffff81010469>] ? sys_mmap+0x29/0x30 [c3] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b [c3]BUG: soft lockup - CPU#16 stuck for 67s! [IMB-MPI1:16438] kernel:Code: e8 5e 83 44 00 0f ae f0 48 8b 7b 30 ff 15 49 ba 9e 00 80 7d c7 00 0f 84 9f fe ff ff f6 43 20 01 0f 84 95 fe ff ff 0f 1f 44 00 00 <f3> 90 f6 43 20 01 75 f8 e9 83 fe ff ff 0f 1f 00 4c 89 ea 4c 89 [root@compute-0-3 ~]# Message from syslogd@compute-0-3 at Jul 11 20:25:18 ... kernel:Stack: Message from syslogd@compute-0-3 at Jul 11 20:25:18 ... kernel:Call Trace: l 11 20:26:42 compute-0-3 kernel: [<ffffffff8117724c>] ? sys_write+0x7c/0x90 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81010469>] ? sys_mmap+0x29/0x30 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b Jul 11 20:26:42 compute-0-3 kernel: Code: 00 00 00 01 74 05 e8 62 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89 Jul 11 20:26:42 compute-0-3 kernel: Call Trace: Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81072cf0>] ? __is_ram+0x0/0x10 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81046502>] ? reserve_memtype+0x492/0x590 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81177f58>] ? alloc_file+0x98/0xe0 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib] Jul 11 20:26:42 compute-0-3 kernel: [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs] Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff811421c0>] ? mmap_region+0x400/0x590 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8117724c>] ? sys_write+0x7c/0x90 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81010469>] ? sys_mmap+0x29/0x30 Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b *************************** c24 c56 c103,c67 dead kernel error ************************************ 2012.7.12 kernel dead c42,c29 c61,c28 ic55,ic74 c15,c67,c66 ******************** after flash new bios ic103,ic15,ic24,ic28,ic31,ic41,ic56,ic61,ic66,ic67,ic73,ic87 ************************* dead ic16 dead ic40,ic41,ic67 dead ic24,ic37,ic79,ic88,ic95,c27,c63 dead ic60 2700s dead ic98 548s dead ic42 1000s dead ic60,ic30,ic67 4528s dead ic94 3734s dead ic73 1s dead ic16 3000s dead ic91 1272s dead ic66 dead ic77 ****************************************** ic47,ic56,ic63,ic81,ic95 ######dead 2012/7/14###################### ic17,ic18,ic2,ic26,ic42,ic43,ic5,ic56,ic66,ic70,ic73,ic9,ic91,ic95 change os ic66,ic67,ic56,ic73,ic16,ic41,ic24,ic95,ic60,ic87 dead ic48 ic29 300G 60*60 $$$$$$$$$$$$$$$$$$$$$$$ machine check ic57,ic17,ic45 dead ic29, ic79,ic2, ic74,ic29 ic39,ic14,ic34 Message from syslogd@compute-0-0 at Jul 15 06:52:27 ... kernel:[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00c00098080813 Message from syslogd@compute-0-0 at Jul 15 06:52:27 ... kernel:[Hardware Error]: Northbridge Error (node 3): DRAM ECC error detected on the NB. Message from syslogd@compute-0-0 at Jul 15 06:52:27 ... kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) BUG: soft lockup - CPU#18 stuck for 67s! [khugepaged:343] Modules linked in: knem(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler target_core_iblock target_core_file target_core_pscsi 8021q target_core_mod garp stp configfs llc sunrpc cachefiles fscache(T) ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) uinput sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class pata_acpi ata_generic pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] CPU 18 Modules linked in: knem(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler target_core_iblock target_core_file target_core_pscsi 8021q target_core_mod garp stp configfs llc sunrpc cachefiles fscache(T) ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) uinput sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class pata_acpi ata_generic pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Pid: 343, comm: khugepaged Tainted: G ---------------- T 2.6.32-220.el6.x86_64 #1 MICRO-STAR INTERNATIONAL CO., LTD MS-91F2/MS-91F2 RIP: 0010:[<ffffffff81047d2a>] [<ffffffff81047d2a>] flush_tlb_others_ipi+0x11a/0x130 RSP: 0000:ffff8804144e3d00 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff8804144e3d40 RCX: 0000000000000030 RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffffffff81e168d8 RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: ffff880000000030 R12: ffff8804144e3cf0 R13: ffffffff8100bc0e R14: ffff880c16684000 R15: 00000000ffffffff FS: 00002af6049842e0(0000) GS:ffff88082e440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000002859008 CR3: 0000000c13cad000 CR4: 00000000000406e0 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ic15 [ic15]mpt2sas0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436) ===================================================== I think this is a hardware problem , problem about CPU or motherboard, or kernel version, I am not sure. We have send help to AMD guys, but they tell us it's not their bussiness. Maybe some guys here can help us or give me some useful advice . Thanks |
Can you give us your full dmesg output? Curiously you have two tainted modules, one of which has proprietary code, whereas the other one is fully GPL compliant (http://www.novell.com/support/kb/doc.php?id=3582750). It appears that khugepaged is the culprit- maybe you can disable khugepaged (after a reboot obviously) and see what happens:
Code:
echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled |
Quote:
the system has been reinstall. I will get it later. We have installed proprietary MLNX OFED driver. Which two modules do you mean ? PS. The MEM of the system is 64GB |
1 Attachment(s)
Hi,
log from kdump file attached |
It appears that your have a DRAM ECC error. Rum memtest86 first and see it still shows you memory error. If not, your dump file shows errors on other modules as well.
|
Quote:
if I add 'acpi=off noapic' to kernel cmdline in grub, the kernel panic will occured very little, but still some servers will kernel panic |
Hi, how did you check the DRAM ECC error? Did you actually run memtest86?
|
Quote:
I think maybe there is some problem in motherboard, maybe acpi or hardware problem |
Hi dumashu, I understand that your college has checked the ECC errors for days. But how?
|
What kind of hardware are you running on? Might want to make sure firmware and bios are up to date. We had some random lockups and issues like you are describing on our Dell systems until we turned off C-states and made sure that all the firmware was updated.
|
Quote:
the ECC errors not made the system kernel panic or freeze when it happened , and we can find many ECC errors log in /var/log/message. |
Quote:
BIOS firmware was already updated, but we still consider there is some problem on the motherboard. Thanks for you suggestion, we will try that . |
Quote:
|
Quote:
|
All times are GMT -5. The time now is 06:41 AM. |