LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   CentOS 6.2 system kernel panic(dead or freeze) on new hardware (https://www.linuxquestions.org/questions/linux-server-73/centos-6-2-system-kernel-panic-dead-or-freeze-on-new-hardware-4175416768/)

dumashu 07-14-2012 08:39 PM

CentOS 6.2 system kernel panic(dead or freeze) on new hardware
 
Hi everyone,

We have a new cluster some days ago, use 2-ways AMD 6272 16 cores CPU, running CentOS 6.2 x64, about 50 servers . And system kernel panic or freeze occured random.
some log my colleague send to me attached below

log:

*****************
2012/7/11
++++++++++++++++++++++++++++++
c57 ,c45,c56,c7,c87,c107,c54,c69,c21,c105,c96,c80,c39,c105,c21,c6,c13,c104,c18,c19,c9,
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
[ic56][Hardware Error]: MC4_STATUS[Over|CE|-|-|AddrV|-|-|CECC]: 0xd430c000ff080813
[ic56][Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[ic56]EDAC amd64 MC0: CE ERROR_ADDRESS= 0x29aee9d40
[ic56]EDAC MC0: CE page 0x29aee9, offset 0xd40, grain 0, syndrome 0xff61, row 3, channel 1, label "": amd64_edac

++++++++++++++

IB reboot c33

+++++++++++++

dead
c31 (2.45h kernel panic detected hard lockup cpu 27)
c94 c95 ( 2.50h kernel panic detected hard lockup cpu)
c98 (3h kernel panic detected hard lockup cpu22)
c73
c15
c66


+++++++++++++++++++++++++++

c3 2 h
c2 3.27 h
BUG: soft lockup - CPU#6 stuck for 67s! [IMB-MPI1:17787]
[c3]Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_connt
rack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT CpuSpy(P)(U) xt_CHECKSUM iptable_mangle knem(U) bridge autofs4 s
unrpc 8021q garp stp llc iptable_filter ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U)
ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) vhos
t_net macvtap macvlan tun kvm_amd kvm sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp
ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci dm_mirro
r dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[c3]CPU 6
[c3]Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_connt
rack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT CpuSpy(P)(U) xt_CHECKSUM iptable_mangle knem(U) bridge autofs4 s
unrpc 8021q garp stp llc iptable_filter ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U)
ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) vhos
t_net macvtap macvlan tun kvm_amd kvm sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp
ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci dm_mirro
r dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[c3]
[c3]Pid: 17787, comm: IMB-MPI1 Tainted: P ---------------- 2.6.32-220.13.1.el6.x86_64 #1 MICRO-STAR I
NTERNATIONAL CO., LTD MS-91F2/MS-91F2
[c3]RIP: 0010:[<ffffffff814ef83e>] [<ffffffff814ef83e>] _spin_lock+0x1e/0x30
[c3]RSP: 0018:ffff8809732459f8 EFLAGS: 00000293
[c3]RAX: 0000000000000129 RBX: ffff8809732459f8 RCX: 0000000000000000
[c3]RDX: 0000000000000127 RSI: ffff880973245a78 RDI: ffffffff81fab308
[c3]RBP: ffffffff8100bc0e R08: ffffffff81c002c0 R09: 0000000000000000
[c3]R10: 0000000000000010 R11: 0000000000001000 R12: 0000000000000000
[c3]R13: 0000000000001000 R14: ffffffff8113bd57 R15: ffff880973245a28
[c3]FS: 00002b26791b4480(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
[c3]CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[c3]CR2: 00002b267c31e230 CR3: 000000015c096000 CR4: 00000000000406e0
[c3]DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[c3]DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[c3]Process IMB-MPI1 (pid: 17787, threadinfo ffff880973244000, task ffff880973243540)
[c3]Stack:
[c3] ffff880973245a58 ffffffff81148684 ffff88000004c218 ffff880973245a38
[c3]<0> ffff880973245a18 ffff880973245a18 ffff880415c980c0 0000000000000001
[c3]<0> 0000000000000030 ffff8800fc8c5000 ffff880c2e572cb0 ffff880c2e572d58
[c3]Call Trace:
[c3] [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0
[c3] [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180
[c3] [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530
[c3] [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130
[c3] [<ffffffff81072cf0>] ? __is_ram+0x0/0x10
[c3] [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40
[c3] [<ffffffff81046502>] ? reserve_memtype+0x492/0x590
[c3] [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40
[c3] [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90
[c3] [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0
[c3] [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190
[c3] [<ffffffff81177f58>] ? alloc_file+0x98/0xe0
[c3] [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200
[c3] [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib]
[c3] [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs]
[c3] [<ffffffff811421c0>] ? mmap_region+0x400/0x590
[c3] [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380
[c3] [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0
[c3] [<ffffffff8117724c>] ? sys_write+0x7c/0x90
[<ffffffff81010469>] ? sys_mmap+0x29/0x30
[c3] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
[c3]Code: 00 00 00 01 74 05 e8 62 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c
1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
[c3]Call Trace:
[c3] [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0
[c3] [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180
[c3] [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530
[c3] [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130
[c3] [<ffffffff81072cf0>] ? __is_ram+0x0/0x10
[c3] [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40
[c3] [<ffffffff81046502>] ? reserve_memtype+0x492/0x590
[c3] [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40
[c3] [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90
[c3] [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0
[c3] [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190
[c3] [<ffffffff81177f58>] ? alloc_file+0x98/0xe0
[c3] [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200
[c3] [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib]
[c3] [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs]
[c3] [<ffffffff811421c0>] ? mmap_region+0x400/0x590
[c3] [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380
[c3] [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0
[c3] [<ffffffff8117724c>] ? sys_write+0x7c/0x90
[c3] [<ffffffff81010469>] ? sys_mmap+0x29/0x30
[c3] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
[c3]BUG: soft lockup - CPU#16 stuck for 67s! [IMB-MPI1:16438]

kernel:Code: e8 5e 83 44 00 0f ae f0 48 8b 7b 30 ff 15 49 ba 9e 00 80 7d c7 00 0f 84 9f fe ff ff f6 43 20 01 0f 84 95 fe ff ff 0f 1f 44 00 00 <f3> 90 f6 43 20 01 75 f8 e9 83 fe ff ff 0f 1f 00 4c 89 ea 4c 89

[root@compute-0-3 ~]#
Message from syslogd@compute-0-3 at Jul 11 20:25:18 ...
kernel:Stack:

Message from syslogd@compute-0-3 at Jul 11 20:25:18 ...

kernel:Call Trace:


l 11 20:26:42 compute-0-3 kernel: [<ffffffff8117724c>] ? sys_write+0x7c/0x90

Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81010469>] ? sys_mmap+0x29/0x30
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Jul 11 20:26:42 compute-0-3 kernel: Code: 00 00 00 01 74 05 e8 62 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Jul 11 20:26:42 compute-0-3 kernel: Call Trace:
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81072cf0>] ? __is_ram+0x0/0x10
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81046502>] ? reserve_memtype+0x492/0x590
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81177f58>] ? alloc_file+0x98/0xe0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib]
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs]
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff811421c0>] ? mmap_region+0x400/0x590
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8117724c>] ? sys_write+0x7c/0x90
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81010469>] ? sys_mmap+0x29/0x30
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b


***************************

c24 c56 c103,c67 dead kernel error



************************************

2012.7.12


kernel dead c42,c29


c61,c28

ic55,ic74

c15,c67,c66


********************

after flash new bios
ic103,ic15,ic24,ic28,ic31,ic41,ic56,ic61,ic66,ic67,ic73,ic87

*************************

dead ic16
dead ic40,ic41,ic67
dead ic24,ic37,ic79,ic88,ic95,c27,c63
dead ic60
2700s dead ic98
548s dead ic42
1000s dead ic60,ic30,ic67
4528s dead ic94
3734s dead ic73
1s dead ic16
3000s dead ic91
1272s dead ic66
dead ic77

******************************************

ic47,ic56,ic63,ic81,ic95

######dead 2012/7/14######################
ic17,ic18,ic2,ic26,ic42,ic43,ic5,ic56,ic66,ic70,ic73,ic9,ic91,ic95


change os

ic66,ic67,ic56,ic73,ic16,ic41,ic24,ic95,ic60,ic87


dead

ic48
ic29

300G 60*60


$$$$$$$$$$$$$$$$$$$$$$$

machine check
ic57,ic17,ic45

dead

ic29,
ic79,ic2,
ic74,ic29
ic39,ic14,ic34

Message from syslogd@compute-0-0 at Jul 15 06:52:27 ...

kernel:[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00c00098080813

Message from syslogd@compute-0-0 at Jul 15 06:52:27 ...

kernel:[Hardware Error]: Northbridge Error (node 3): DRAM ECC error detected on the NB.

Message from syslogd@compute-0-0 at Jul 15 06:52:27 ...

kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)



BUG: soft lockup - CPU#18 stuck for 67s! [khugepaged:343]

Modules linked in: knem(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler target_core_iblock target_core_file target_core_pscsi 8021q target_core_mod garp stp configfs llc sunrpc cachefiles fscache(T) ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) uinput sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class pata_acpi ata_generic pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 18
Modules linked in: knem(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler target_core_iblock target_core_file target_core_pscsi 8021q target_core_mod garp stp configfs llc sunrpc cachefiles fscache(T) ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) uinput sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class pata_acpi ata_generic pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 343, comm: khugepaged Tainted: G ---------------- T 2.6.32-220.el6.x86_64 #1 MICRO-STAR INTERNATIONAL CO., LTD MS-91F2/MS-91F2

RIP: 0010:[<ffffffff81047d2a>] [<ffffffff81047d2a>] flush_tlb_others_ipi+0x11a/0x130
RSP: 0000:ffff8804144e3d00 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff8804144e3d40 RCX: 0000000000000030
RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffffffff81e168d8
RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff880000000030 R12: ffff8804144e3cf0
R13: ffffffff8100bc0e R14: ffff880c16684000 R15: 00000000ffffffff
FS: 00002af6049842e0(0000) GS:ffff88082e440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000002859008 CR3: 0000000c13cad000 CR4: 00000000000406e0

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

ic15
[ic15]mpt2sas0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436)


=====================================================

I think this is a hardware problem , problem about CPU or motherboard, or kernel version, I am not sure. We have send help to AMD guys, but they tell us it's not their bussiness.

Maybe some guys here can help us or give me some useful advice .
Thanks

rch 07-15-2012 01:57 PM

Can you give us your full dmesg output? Curiously you have two tainted modules, one of which has proprietary code, whereas the other one is fully GPL compliant (http://www.novell.com/support/kb/doc.php?id=3582750). It appears that khugepaged is the culprit- maybe you can disable khugepaged (after a reboot obviously) and see what happens:
Code:

echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled

dumashu 07-15-2012 08:21 PM

Quote:

Originally Posted by rch (Post 4728859)
Can you give us your full dmesg output? Curiously you have two tainted modules, one of which has proprietary code, whereas the other one is fully GPL compliant (http://www.novell.com/support/kb/doc.php?id=3582750). It appears that khugepaged is the culprit- maybe you can disable khugepaged (after a reboot obviously) and see what happens:
Code:

echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled

Hi rch,

the system has been reinstall. I will get it later.

We have installed proprietary MLNX OFED driver.

Which two modules do you mean ?

PS. The MEM of the system is 64GB

dumashu 07-16-2012 03:33 AM

1 Attachment(s)
Hi,

log from kdump file attached

rch 07-17-2012 04:26 PM

It appears that your have a DRAM ECC error. Rum memtest86 first and see it still shows you memory error. If not, your dump file shows errors on other modules as well.

dumashu 07-17-2012 08:08 PM

Quote:

Originally Posted by rch (Post 4730995)
It appears that your have a DRAM ECC error. Rum memtest86 first and see it still shows you memory error. If not, your dump file shows errors on other modules as well.

We have check the ECC error, it not make the system kernel panic,

if I add 'acpi=off noapic' to kernel cmdline in grub, the kernel panic will occured very little, but still some servers will kernel panic

rch 07-18-2012 12:20 PM

Hi, how did you check the DRAM ECC error? Did you actually run memtest86?

dumashu 07-18-2012 08:11 PM

Quote:

Originally Posted by rch (Post 4731850)
Hi, how did you check the DRAM ECC error? Did you actually run memtest86?

My college have check the ECC error for days. When it happened, the system did not get kernel panic.

I think maybe there is some problem in motherboard, maybe acpi or hardware problem

rch 07-19-2012 11:39 AM

Hi dumashu, I understand that your college has checked the ECC errors for days. But how?

grim76 07-19-2012 09:10 PM

What kind of hardware are you running on? Might want to make sure firmware and bios are up to date. We had some random lockups and issues like you are describing on our Dell systems until we turned off C-states and made sure that all the firmware was updated.

dumashu 07-19-2012 10:27 PM

Quote:

Originally Posted by rch (Post 4732861)
Hi dumashu, I understand that your college has checked the ECC errors for days. But how?


the ECC errors not made the system kernel panic or freeze when it happened , and we can find many ECC errors log in /var/log/message.

dumashu 07-19-2012 10:33 PM

Quote:

Originally Posted by grim76 (Post 4733269)
What kind of hardware are you running on? Might want to make sure firmware and bios are up to date. We had some random lockups and issues like you are describing on our Dell systems until we turned off C-states and made sure that all the firmware was updated.

The hardware is very new, the motherboard is made by Micro Star .
BIOS firmware was already updated, but we still consider there is some problem on the motherboard.
Thanks for you suggestion, we will try that .

rch 07-20-2012 07:03 PM

Quote:

Originally Posted by dumashu (Post 4733300)
the ECC errors not made the system kernel panic or freeze when it happened , and we can find many ECC errors log in /var/log/message.

You can find many ECC errors in the log- and you say that it is not a memory problem? Run memtest86 and let it check memory- your memory is probably under warranty and can be replaced. Download a memtest86 iso from here http://www.memtest86.com/. Burn it to a CD and then run the memory test offline. This is the best advice that I can give you. Also, there is a program called mcelog that checks and reports on hardware and memory errors.

dumashu 07-21-2012 09:43 AM

Quote:

Originally Posted by rch (Post 4734153)
You can find many ECC errors in the log- and you say that it is not a memory problem? Run memtest86 and let it check memory- your memory is probably under warranty and can be replaced. Download a memtest86 iso from here http://www.memtest86.com/. Burn it to a CD and then run the memory test offline. This is the best advice that I can give you. Also, there is a program called mcelog that checks and reports on hardware and memory errors.

HI rch, thanks for your advice , we will check the memory and motherboard for more analysis.


All times are GMT -5. The time now is 06:41 AM.