LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices

Reply
 
Search this Thread
Old 07-14-2012, 08:39 PM   #1
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Rep: Reputation: Disabled
CentOS 6.2 system kernel panic(dead or freeze) on new hardware


Hi everyone,

We have a new cluster some days ago, use 2-ways AMD 6272 16 cores CPU, running CentOS 6.2 x64, about 50 servers . And system kernel panic or freeze occured random.
some log my colleague send to me attached below

log:

*****************
2012/7/11
++++++++++++++++++++++++++++++
c57 ,c45,c56,c7,c87,c107,c54,c69,c21,c105,c96,c80,c39,c105,c21,c6,c13,c104,c18,c19,c9,
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
[ic56][Hardware Error]: MC4_STATUS[Over|CE|-|-|AddrV|-|-|CECC]: 0xd430c000ff080813
[ic56][Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
[ic56]EDAC amd64 MC0: CE ERROR_ADDRESS= 0x29aee9d40
[ic56]EDAC MC0: CE page 0x29aee9, offset 0xd40, grain 0, syndrome 0xff61, row 3, channel 1, label "": amd64_edac

++++++++++++++

IB reboot c33

+++++++++++++

dead
c31 (2.45h kernel panic detected hard lockup cpu 27)
c94 c95 ( 2.50h kernel panic detected hard lockup cpu)
c98 (3h kernel panic detected hard lockup cpu22)
c73
c15
c66


+++++++++++++++++++++++++++

c3 2 h
c2 3.27 h
BUG: soft lockup - CPU#6 stuck for 67s! [IMB-MPI1:17787]
[c3]Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_connt
rack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT CpuSpy(P)(U) xt_CHECKSUM iptable_mangle knem(U) bridge autofs4 s
unrpc 8021q garp stp llc iptable_filter ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U)
ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) vhos
t_net macvtap macvlan tun kvm_amd kvm sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp
ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci dm_mirro
r dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[c3]CPU 6
[c3]Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_connt
rack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT CpuSpy(P)(U) xt_CHECKSUM iptable_mangle knem(U) bridge autofs4 s
unrpc 8021q garp stp llc iptable_filter ip_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U)
ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) vhos
t_net macvtap macvlan tun kvm_amd kvm sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp
ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class ata_generic pata_acpi pata_atiixp ahci dm_mirro
r dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[c3]
[c3]Pid: 17787, comm: IMB-MPI1 Tainted: P ---------------- 2.6.32-220.13.1.el6.x86_64 #1 MICRO-STAR I
NTERNATIONAL CO., LTD MS-91F2/MS-91F2
[c3]RIP: 0010:[<ffffffff814ef83e>] [<ffffffff814ef83e>] _spin_lock+0x1e/0x30
[c3]RSP: 0018:ffff8809732459f8 EFLAGS: 00000293
[c3]RAX: 0000000000000129 RBX: ffff8809732459f8 RCX: 0000000000000000
[c3]RDX: 0000000000000127 RSI: ffff880973245a78 RDI: ffffffff81fab308
[c3]RBP: ffffffff8100bc0e R08: ffffffff81c002c0 R09: 0000000000000000
[c3]R10: 0000000000000010 R11: 0000000000001000 R12: 0000000000000000
[c3]R13: 0000000000001000 R14: ffffffff8113bd57 R15: ffff880973245a28
[c3]FS: 00002b26791b4480(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
[c3]CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[c3]CR2: 00002b267c31e230 CR3: 000000015c096000 CR4: 00000000000406e0
[c3]DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[c3]DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[c3]Process IMB-MPI1 (pid: 17787, threadinfo ffff880973244000, task ffff880973243540)
[c3]Stack:
[c3] ffff880973245a58 ffffffff81148684 ffff88000004c218 ffff880973245a38
[c3]<0> ffff880973245a18 ffff880973245a18 ffff880415c980c0 0000000000000001
[c3]<0> 0000000000000030 ffff8800fc8c5000 ffff880c2e572cb0 ffff880c2e572d58
[c3]Call Trace:
[c3] [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0
[c3] [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180
[c3] [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530
[c3] [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130
[c3] [<ffffffff81072cf0>] ? __is_ram+0x0/0x10
[c3] [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40
[c3] [<ffffffff81046502>] ? reserve_memtype+0x492/0x590
[c3] [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40
[c3] [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90
[c3] [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0
[c3] [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190
[c3] [<ffffffff81177f58>] ? alloc_file+0x98/0xe0
[c3] [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200
[c3] [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib]
[c3] [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs]
[c3] [<ffffffff811421c0>] ? mmap_region+0x400/0x590
[c3] [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380
[c3] [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0
[c3] [<ffffffff8117724c>] ? sys_write+0x7c/0x90
[<ffffffff81010469>] ? sys_mmap+0x29/0x30
[c3] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
[c3]Code: 00 00 00 01 74 05 e8 62 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c
1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
[c3]Call Trace:
[c3] [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0
[c3] [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180
[c3] [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530
[c3] [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130
[c3] [<ffffffff81072cf0>] ? __is_ram+0x0/0x10
[c3] [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40
[c3] [<ffffffff81046502>] ? reserve_memtype+0x492/0x590
[c3] [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40
[c3] [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
[c3] [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90
[c3] [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0
[c3] [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190
[c3] [<ffffffff81177f58>] ? alloc_file+0x98/0xe0
[c3] [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200
[c3] [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib]
[c3] [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs]
[c3] [<ffffffff811421c0>] ? mmap_region+0x400/0x590
[c3] [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380
[c3] [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0
[c3] [<ffffffff8117724c>] ? sys_write+0x7c/0x90
[c3] [<ffffffff81010469>] ? sys_mmap+0x29/0x30
[c3] [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
[c3]BUG: soft lockup - CPU#16 stuck for 67s! [IMB-MPI1:16438]

kernel:Code: e8 5e 83 44 00 0f ae f0 48 8b 7b 30 ff 15 49 ba 9e 00 80 7d c7 00 0f 84 9f fe ff ff f6 43 20 01 0f 84 95 fe ff ff 0f 1f 44 00 00 <f3> 90 f6 43 20 01 75 f8 e9 83 fe ff ff 0f 1f 00 4c 89 ea 4c 89

[root@compute-0-3 ~]#
Message from syslogd@compute-0-3 at Jul 11 20:25:18 ...
kernel:Stack:

Message from syslogd@compute-0-3 at Jul 11 20:25:18 ...

kernel:Call Trace:


l 11 20:26:42 compute-0-3 kernel: [<ffffffff8117724c>] ? sys_write+0x7c/0x90

Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81010469>] ? sys_mmap+0x29/0x30
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b
Jul 11 20:26:42 compute-0-3 kernel: Code: 00 00 00 01 74 05 e8 62 79 d8 ff c9 c3 55 48 89 e5 0f 1f 44 00 00 b8 00 00 01 00 f0 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 <0f> b7 17 eb f5 83 3f 00 75 f4 eb df c9 c3 0f 1f 40 00 55 48 89
Jul 11 20:26:42 compute-0-3 kernel: Call Trace:
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81148684>] ? __purge_vmap_area_lazy+0x174/0x1e0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8114a23d>] ? vm_unmap_aliases+0x16d/0x180
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff810446be>] ? change_page_attr_set_clr+0xbe/0x530
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81073974>] ? walk_system_ram_range+0x64/0x130
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81072cf0>] ? __is_ram+0x0/0x10
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8104539f>] ? _set_memory_uc+0x2f/0x40
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81046502>] ? reserve_memtype+0x492/0x590
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff810434eb>] ? ioremap_change_attr+0x2b/0x40
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81045d56>] ? kernel_map_sync_memtype+0x86/0xf0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8104674d>] ? reserve_pfn_range+0x14d/0x1e0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81046863>] ? track_pfn_vma_new+0x83/0x90
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81138554>] ? remap_pfn_range+0xa4/0x4a0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8115f7e2>] ? kmem_cache_alloc+0x182/0x190
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81177f58>] ? alloc_file+0x98/0xe0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff811baa88>] ? anon_inode_getfile+0x128/0x200
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffffa0234ab3>] ? mlx4_ib_mmap+0x83/0x100 [mlx4_ib]
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffffa026502c>] ? ib_uverbs_mmap+0x2c/0x30 [ib_uverbs]
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff811421c0>] ? mmap_region+0x400/0x590
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8114268a>] ? do_mmap_pgoff+0x33a/0x380
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81132120>] ? sys_mmap_pgoff+0x200/0x2d0
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8117724c>] ? sys_write+0x7c/0x90
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff81010469>] ? sys_mmap+0x29/0x30
Jul 11 20:26:42 compute-0-3 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b


***************************

c24 c56 c103,c67 dead kernel error



************************************

2012.7.12


kernel dead c42,c29


c61,c28

ic55,ic74

c15,c67,c66


********************

after flash new bios
ic103,ic15,ic24,ic28,ic31,ic41,ic56,ic61,ic66,ic67,ic73,ic87

*************************

dead ic16
dead ic40,ic41,ic67
dead ic24,ic37,ic79,ic88,ic95,c27,c63
dead ic60
2700s dead ic98
548s dead ic42
1000s dead ic60,ic30,ic67
4528s dead ic94
3734s dead ic73
1s dead ic16
3000s dead ic91
1272s dead ic66
dead ic77

******************************************

ic47,ic56,ic63,ic81,ic95

######dead 2012/7/14######################
ic17,ic18,ic2,ic26,ic42,ic43,ic5,ic56,ic66,ic70,ic73,ic9,ic91,ic95


change os

ic66,ic67,ic56,ic73,ic16,ic41,ic24,ic95,ic60,ic87


dead

ic48
ic29

300G 60*60


$$$$$$$$$$$$$$$$$$$$$$$

machine check
ic57,ic17,ic45

dead

ic29,
ic79,ic2,
ic74,ic29
ic39,ic14,ic34

Message from syslogd@compute-0-0 at Jul 15 06:52:27 ...

kernel:[Hardware Error]: MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00c00098080813

Message from syslogd@compute-0-0 at Jul 15 06:52:27 ...

kernel:[Hardware Error]: Northbridge Error (node 3): DRAM ECC error detected on the NB.

Message from syslogd@compute-0-0 at Jul 15 06:52:27 ...

kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)



BUG: soft lockup - CPU#18 stuck for 67s! [khugepaged:343]

Modules linked in: knem(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler target_core_iblock target_core_file target_core_pscsi 8021q target_core_mod garp stp configfs llc sunrpc cachefiles fscache(T) ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) uinput sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class pata_acpi ata_generic pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 18
Modules linked in: knem(U) autofs4 ipmi_devintf ipmi_si ipmi_msghandler target_core_iblock target_core_file target_core_pscsi 8021q target_core_mod garp stp configfs llc sunrpc cachefiles fscache(T) ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables rdma_ucm(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ib_cm(U) ib_sa(U) ipv6 ib_uverbs(U) ib_umad(U) iw_nes(U) libcrc32c iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) uinput sg igb dca microcode amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif mpt2sas scsi_transport_sas raid_class pata_acpi ata_generic pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 343, comm: khugepaged Tainted: G ---------------- T 2.6.32-220.el6.x86_64 #1 MICRO-STAR INTERNATIONAL CO., LTD MS-91F2/MS-91F2

RIP: 0010:[<ffffffff81047d2a>] [<ffffffff81047d2a>] flush_tlb_others_ipi+0x11a/0x130
RSP: 0000:ffff8804144e3d00 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff8804144e3d40 RCX: 0000000000000030
RDX: 0000000000000000 RSI: 0000000000000030 RDI: ffffffff81e168d8
RBP: ffffffff8100bc0e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff880000000030 R12: ffff8804144e3cf0
R13: ffffffff8100bc0e R14: ffff880c16684000 R15: 00000000ffffffff
FS: 00002af6049842e0(0000) GS:ffff88082e440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000002859008 CR3: 0000000c13cad000 CR4: 00000000000406e0

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

ic15
[ic15]mpt2sas0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436)


=====================================================

I think this is a hardware problem , problem about CPU or motherboard, or kernel version, I am not sure. We have send help to AMD guys, but they tell us it's not their bussiness.

Maybe some guys here can help us or give me some useful advice .
Thanks

Last edited by dumashu; 07-14-2012 at 09:26 PM.
 
Old 07-15-2012, 01:57 PM   #2
rch
Member
 
Registered: Feb 2003
Location: Santa Clara,CA
Distribution: Mandriva
Posts: 909

Rep: Reputation: 48
Can you give us your full dmesg output? Curiously you have two tainted modules, one of which has proprietary code, whereas the other one is fully GPL compliant (http://www.novell.com/support/kb/doc.php?id=3582750). It appears that khugepaged is the culprit- maybe you can disable khugepaged (after a reboot obviously) and see what happens:
Code:
echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled
 
Old 07-15-2012, 08:21 PM   #3
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rch View Post
Can you give us your full dmesg output? Curiously you have two tainted modules, one of which has proprietary code, whereas the other one is fully GPL compliant (http://www.novell.com/support/kb/doc.php?id=3582750). It appears that khugepaged is the culprit- maybe you can disable khugepaged (after a reboot obviously) and see what happens:
Code:
echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled
Hi rch,

the system has been reinstall. I will get it later.

We have installed proprietary MLNX OFED driver.

Which two modules do you mean ?

PS. The MEM of the system is 64GB
 
Old 07-16-2012, 03:33 AM   #4
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Hi,

log from kdump file attached
Attached Files
File Type: log crash.log (108.1 KB, 3 views)

Last edited by dumashu; 07-16-2012 at 03:36 AM.
 
Old 07-17-2012, 04:26 PM   #5
rch
Member
 
Registered: Feb 2003
Location: Santa Clara,CA
Distribution: Mandriva
Posts: 909

Rep: Reputation: 48
It appears that your have a DRAM ECC error. Rum memtest86 first and see it still shows you memory error. If not, your dump file shows errors on other modules as well.
 
Old 07-17-2012, 08:08 PM   #6
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rch View Post
It appears that your have a DRAM ECC error. Rum memtest86 first and see it still shows you memory error. If not, your dump file shows errors on other modules as well.
We have check the ECC error, it not make the system kernel panic,

if I add 'acpi=off noapic' to kernel cmdline in grub, the kernel panic will occured very little, but still some servers will kernel panic
 
Old 07-18-2012, 12:20 PM   #7
rch
Member
 
Registered: Feb 2003
Location: Santa Clara,CA
Distribution: Mandriva
Posts: 909

Rep: Reputation: 48
Hi, how did you check the DRAM ECC error? Did you actually run memtest86?
 
Old 07-18-2012, 08:11 PM   #8
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rch View Post
Hi, how did you check the DRAM ECC error? Did you actually run memtest86?
My college have check the ECC error for days. When it happened, the system did not get kernel panic.

I think maybe there is some problem in motherboard, maybe acpi or hardware problem
 
Old 07-19-2012, 11:39 AM   #9
rch
Member
 
Registered: Feb 2003
Location: Santa Clara,CA
Distribution: Mandriva
Posts: 909

Rep: Reputation: 48
Hi dumashu, I understand that your college has checked the ECC errors for days. But how?
 
Old 07-19-2012, 09:10 PM   #10
grim76
Member
 
Registered: Jun 2007
Distribution: Debian, SLES, Ubuntu
Posts: 281

Rep: Reputation: 47
What kind of hardware are you running on? Might want to make sure firmware and bios are up to date. We had some random lockups and issues like you are describing on our Dell systems until we turned off C-states and made sure that all the firmware was updated.
 
Old 07-19-2012, 10:27 PM   #11
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rch View Post
Hi dumashu, I understand that your college has checked the ECC errors for days. But how?

the ECC errors not made the system kernel panic or freeze when it happened , and we can find many ECC errors log in /var/log/message.
 
Old 07-19-2012, 10:33 PM   #12
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grim76 View Post
What kind of hardware are you running on? Might want to make sure firmware and bios are up to date. We had some random lockups and issues like you are describing on our Dell systems until we turned off C-states and made sure that all the firmware was updated.
The hardware is very new, the motherboard is made by Micro Star .
BIOS firmware was already updated, but we still consider there is some problem on the motherboard.
Thanks for you suggestion, we will try that .
 
Old 07-20-2012, 07:03 PM   #13
rch
Member
 
Registered: Feb 2003
Location: Santa Clara,CA
Distribution: Mandriva
Posts: 909

Rep: Reputation: 48
Quote:
Originally Posted by dumashu View Post
the ECC errors not made the system kernel panic or freeze when it happened , and we can find many ECC errors log in /var/log/message.
You can find many ECC errors in the log- and you say that it is not a memory problem? Run memtest86 and let it check memory- your memory is probably under warranty and can be replaced. Download a memtest86 iso from here http://www.memtest86.com/. Burn it to a CD and then run the memory test offline. This is the best advice that I can give you. Also, there is a program called mcelog that checks and reports on hardware and memory errors.
 
Old 07-21-2012, 09:43 AM   #14
dumashu
LQ Newbie
 
Registered: Jul 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rch View Post
You can find many ECC errors in the log- and you say that it is not a memory problem? Run memtest86 and let it check memory- your memory is probably under warranty and can be replaced. Download a memtest86 iso from here http://www.memtest86.com/. Burn it to a CD and then run the memory test offline. This is the best advice that I can give you. Also, there is a program called mcelog that checks and reports on hardware and memory errors.
HI rch, thanks for your advice , we will check the memory and motherboard for more analysis.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] System Freeze, flashing power light then panic on reboot. ScottReed Linux - Hardware 0 09-18-2008 09:54 PM
"make clean" caused system freeze, kernel panic on reboot jwhill2000 Linux - General 3 02-08-2005 07:59 PM
Copy Files Between Drives Causes Kernel Panic/Freeze of the System php Slackware 7 08-14-2003 07:30 PM


All times are GMT -5. The time now is 09:01 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration