LinuxQuestions.org
Go Job Hunting at the LQ Job Marketplace
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
 
Search this Thread
Old 01-09-2011, 04:52 PM   #1
olego
Member
 
Registered: Sep 2008
Location: Kaliningrad, Russia
Distribution: Slackware
Posts: 35

Rep: Reputation: 1
mcelog: HARDWARE ERROR. This is *NOT* a software problem!


Hello, guys!

I have my desktop with following hardware:
1. MB: ASUS P5QL SE/EPU
2. RAM: 2 x 2GB Corsair PC2-8500 (1066 MGhz)
3. CPU: Intel Dual-Core E6500
4. GPU: nVidia GeForce 9400 GT with binary driver
5. Net: D-Link DWA-520 with madwifi driver

This machine is running for more than two years without any problem, but during the last 2 or 3 months I get deadlock hangs once or twice a week. This machine runs Slackware-current 32 bit with custom compiled vanilla kernel with two additional patches - BFS and TuxOnIce. Deadlocks usually occur when there is no user activity - only rtorrent is running and two KDE4 sessions is open (with firefox, okular, claws-mail, goldendict, virtualbox and other memory consuming apps).

Here is an excerpt from my syslog:
Quote:
Jan 6 20:47:13 oleg2 mcelog: failed to prefill DIMM database from DMI data
Jan 6 20:47:13 oleg2 mcelog: Kernel does not support page offline interface
Jan 6 20:47:13 oleg2 mcelog: HARDWARE ERROR. This is *NOT* a software problem!
Jan 6 20:47:13 oleg2 mcelog: Please contact your hardware vendor
Jan 6 20:47:13 oleg2 mcelog: MCE 0
Jan 6 20:47:13 oleg2 mcelog: CPU 0 BANK 0
Jan 6 20:47:13 oleg2 mcelog: TIME 1294339633 Thu Jan 6 20:47:13 2011
Jan 6 20:47:13 oleg2 mcelog: MCG status:
Jan 6 20:47:13 oleg2 mcelog: MCi status:
Jan 6 20:47:13 oleg2 mcelog: Error overflow
Jan 6 20:47:13 oleg2 mcelog: Uncorrected error
Jan 6 20:47:13 oleg2 mcelog: Error enabled
Jan 6 20:47:13 oleg2 mcelog: Processor context corrupt
Jan 6 20:47:13 oleg2 mcelog: MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-timeout Error
Jan 6 20:47:13 oleg2 mcelog: BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
Jan 6 20:47:13 oleg2 mcelog: timeout BINIT (ROB timeout). No micro-instruction retired for some time
Jan 6 20:47:13 oleg2 mcelog: failure that caused IERR
Jan 6 20:47:13 oleg2 mcelog: STATUS f200084000000800 MCGSTATUS 0
Jan 6 20:47:13 oleg2 mcelog: MCGCAP 806 APICID 0 SOCKETID 0
Jan 6 20:47:13 oleg2 mcelog: CPUID Vendor Intel Family 6 Model 23
Jan 6 20:47:13 oleg2 mcelog: HARDWARE ERROR. This is *NOT* a software problem!
Jan 6 20:47:13 oleg2 mcelog: Please contact your hardware vendor
Jan 6 20:47:13 oleg2 mcelog: MCE 1
Jan 6 20:47:13 oleg2 mcelog: CPU 0 BANK 5
Jan 6 20:47:13 oleg2 mcelog: TIME 1294339633 Thu Jan 6 20:47:13 2011
Jan 6 20:47:13 oleg2 mcelog: MCG status:
Jan 6 20:47:13 oleg2 mcelog: MCi status:
Jan 6 20:47:13 oleg2 mcelog: Error overflow
Jan 6 20:47:13 oleg2 mcelog: Uncorrected error
Jan 6 20:47:13 oleg2 mcelog: Error enabled
Jan 6 20:47:13 oleg2 mcelog: Processor context corrupt
Jan 6 20:47:13 oleg2 mcelog: MCA: BUS Level-3 Generic Generic Other-transaction Request-timeout Error
Jan 6 20:47:13 oleg2 mcelog: BQ_DCU_READ_TYPE BQ_ERR_AERR2_TYPE BQ_ERR_AERR2_TYPE
Jan 6 20:47:13 oleg2 mcelog: received parity error on response transaction
Jan 6 20:47:13 oleg2 mcelog: MCE driven MCE is observed
Jan 6 20:47:13 oleg2 mcelog: STATUS f200001034000e0f MCGSTATUS 0
Jan 6 20:47:13 oleg2 mcelog: MCGCAP 806 APICID 0 SOCKETID 0
Jan 6 20:47:13 oleg2 mcelog: CPUID Vendor Intel Family 6 Model 23
Jan 6 20:47:13 oleg2 mcelog: HARDWARE ERROR. This is *NOT* a software problem!
Jan 6 20:47:13 oleg2 mcelog: Please contact your hardware vendor
Jan 6 20:47:13 oleg2 mcelog: MCE 2
Jan 6 20:47:13 oleg2 mcelog: CPU 1 BANK 5
Jan 6 20:47:13 oleg2 mcelog: TIME 1294339633 Thu Jan 6 20:47:13 2011
Jan 6 20:47:13 oleg2 mcelog: MCG status:
Jan 6 20:47:13 oleg2 mcelog: MCi status:
Jan 6 20:47:13 oleg2 mcelog: Error overflow
Jan 6 20:47:13 oleg2 mcelog: Uncorrected error
Jan 6 20:47:13 oleg2 mcelog: Error enabled
Jan 6 20:47:13 oleg2 mcelog: Processor context corrupt
Jan 6 20:47:13 oleg2 mcelog: MCA: BUS Level-3 Generic Generic Other-transaction Request-timeout Error
Jan 6 20:47:13 oleg2 mcelog: BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
Jan 6 20:47:13 oleg2 mcelog: received parity error on response transaction
Jan 6 20:47:13 oleg2 mcelog: MCE driven
Jan 6 20:47:13 oleg2 mcelog: STATUS f200001010000e0f MCGSTATUS 0
Jan 6 20:47:13 oleg2 mcelog: MCGCAP 806 APICID 1 SOCKETID 0
Jan 6 20:47:13 oleg2 mcelog: CPUID Vendor Intel Family 6 Model 23
I just would like to get a clue - what should I replace first - a memory or a CPU? The price is almost the same - around hundred of bucks. I tried to run memtest86 3.5a and got 1983 memory errors, but I had several problems with false positive errors with earlier versions of memtest86 and I don't trust it on 100%. My idea is to update BIOS firmware, update to the new kernel (I'm waiting for 2.6.37) and only then replace a RAM.
 
Old 01-09-2011, 05:51 PM   #2
stress_junkie
Senior Member
 
Registered: Dec 2005
Location: Massachusetts, USA
Distribution: Ubuntu 10.04 and CentOS 5.5
Posts: 3,873

Rep: Reputation: 332Reputation: 332Reputation: 332Reputation: 332
Corsair doesn't list that exact model of motherboard in its memory configurator.
http://www.corsair.com/learn_n_explore/

I couldn't quickly find specs on the motherboard at the Asus web site. I would guess that the memory doesn't match the memory controller on the motherboard in some way. I don't know in what way. The memory specs are fast enough that the speed should not be a problem. If the memory sticks are correctly seated then it doesn't make sense to me.

If I were you I would go to the Corsair web site and see if they have some kind of warning about using their memory on that motherboard.
 
Old 01-09-2011, 07:36 PM   #3
rolf
Member
 
Registered: Jul 2001
Location: Oakland, CA
Distribution: Mandriva 2010.2 x86_64
Posts: 186

Rep: Reputation: 33
Just grasping at straws but further to the Corsair factor, I've got
  1. Asus P5Q Deluxe
  2. Evga/nvidia 9500 GT
  3. Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
  4. Corsair CM2X1024-8500C5D, PC2 1066MHz, 2x1G
I had freezes, went to the Corsair website forums, found a lot of complaints about this memory. I've since changed to some ADATA but, if memory serves, the SPD of the Corsair modules was not 1066MHz but I set it in BIOS to that speed. I think it is being overhyped as a sales tactic. Another user error, in my case, was to have the memory voltage set lower than these modules are spec'ed at, so that might be something to look at. Good luck.
 
Old 01-10-2011, 07:39 AM   #4
olego
Member
 
Registered: Sep 2008
Location: Kaliningrad, Russia
Distribution: Slackware
Posts: 35

Original Poster
Rep: Reputation: 1
ASUS web-site - http://www.asus.com/product.aspx?P_ID=7vTrhQ6JvyI0MitJ
The exact model is ASUS P5QL/EPU

ASUS claims that this board must support 4 x DIMM, Max. 16 GB, DDR2 1066/800/667 Memory
Dual Channel memory architecture
.

Anyway - I updated BIOS to the latest version 0408 - no effect - I still get errors in memtest86.
Secondly I changed memory type in BIOS to DDR2-800 - no effect - there are errors.

Here is errors I got from memtest86: http://s2.itrash.ru/idb/1502/oP1106974ab.JPG

BTW memtest86 incorrectly detected my chipset - this board has Intel P43 and not P45 chipset.

I suggest that Corsair memory is rather good memory. I could try to buy Goodram PC-2 6400 800 MHz, but I think it's worse than Corsair.
 
Old 01-10-2011, 09:48 AM   #5
rolf
Member
 
Registered: Jul 2001
Location: Oakland, CA
Distribution: Mandriva 2010.2 x86_64
Posts: 186

Rep: Reputation: 33
I, also, have consistently bought Corsair because of its good reputation. Part of that is the lifetime warranty. http://forum.corsair.com/forums/forumdisplay.php?f=145
I've followed the Corsair RMA procedure a couple of times to replace failing memory, fairly painlessly.
 
Old 01-10-2011, 02:49 PM   #6
onebuck
Moderator
 
Registered: Jan 2005
Location: Midwest USA, Central Illinois
Distribution: SlackwareŽ
Posts: 11,649
Blog Entries: 10

Rep: Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576Reputation: 1576
Hi,

Check it with;
Quote:
memtest86+ <- 'memory tester which is based on memtest86 v3.0, and provides an up-to-date version of this useful tool, which aims to be as reliable as the original. It has been fixed to work on AMD64 systems, and also properly detects all current CPUs and motherboard chipsets. The project supports ECC polling for AMD64, i875P, and E7205, and displays some useful settings for the most popular chipsets'
 
Old 01-10-2011, 04:25 PM   #7
olego
Member
 
Registered: Sep 2008
Location: Kaliningrad, Russia
Distribution: Slackware
Posts: 35

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by onebuck View Post
Hi,

Check it with memtest86+
Ok, I built and run test 5 minutes ago. I'll inform you about results.
This test said me that I have 2 Corsair CM2X2048-8500C5D modules (frankly, I forgot the exact model of memory). This model doesn't listed on the Corsair's web site, but the same model with -6400C5 does! I googled, that correct model is TWIN2X4096-8500C5D.

memtest shows my timings 5-5-5-18, but I found that they should be 5-5-5-15-2T.

UPD: 2 passes of memtest86+ finished with NO errors at all. Very strange behaviour.

Last edited by olego; 01-11-2011 at 05:09 AM.
 
Old 01-13-2011, 01:09 AM   #8
olego
Member
 
Registered: Sep 2008
Location: Kaliningrad, Russia
Distribution: Slackware
Posts: 35

Original Poster
Rep: Reputation: 1
Hello,

I openned a cover of my computer, just to be shure that my memory is 1066 MHz. Yes, the exact model is Corsair XMS2, on the module itself I found next legend - "CM2X2048-8500-C5D 5-5-5-15 2.1V". In the BIOS I set up 2.1V memory voltage, and set correct timings 5-5-5-15. And got exactly the same results - memtest86+ gives me NO errors at all, but memtest86 3.5a gives me a lot of errors in the same test #7 - random number sequence. All other tests give me NO errors. So my question - is this memory bad or good? Which test is more reliable - memtest86 or memtest86+ ?
 
Old 01-13-2011, 01:28 AM   #9
EDDY1
LQ Addict
 
Registered: Mar 2010
Location: Oakland,Ca
Distribution: wins7, Debian wheezy
Posts: 6,495

Rep: Reputation: 614Reputation: 614Reputation: 614Reputation: 614Reputation: 614Reputation: 614
Newbie question,
Isn't system 64 bit?
Doesn't 32 bit os only recognize 3 Gig or is it just that 64 bit os doesn't show benefit until 4 Gigs ram?
 
Old 01-13-2011, 01:54 AM   #10
olego
Member
 
Registered: Sep 2008
Location: Kaliningrad, Russia
Distribution: Slackware
Posts: 35

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by EDDY1 View Post
Isn't system 64 bit?
It isn't. This particular machine has 32 bit Linux

Quote:
Originally Posted by EDDY1 View Post
Doesn't 32 bit os only recognize 3 Gig or is it just that 64 bit os doesn't show benefit until 4 Gigs ram?
No, it doesn't. 32 bit OS can see the whole memory through Physical Address Extension (PAE). This way is slower than native 64 bit is does, but I'm reluctant at this moment to upgrade my system to 64 bit. My new laptop has 4GB of RAM and 64 bit Linux (multilib) on it and I have no problems with it. There are only 3 32bit apps in my system - OpenOffice, skype and Visual Slickedit. All other are native 64 bit apps. And it works great - so I think there is no reason to keep 32 bit OS, because 64 bit OSes are pretty stable and fast now.
 
Old 01-13-2011, 02:30 AM   #11
EDDY1
LQ Addict
 
Registered: Mar 2010
Location: Oakland,Ca
Distribution: wins7, Debian wheezy
Posts: 6,495

Rep: Reputation: 614Reputation: 614Reputation: 614Reputation: 614Reputation: 614Reputation: 614
Sorry Like I said newbie question
But I though it may be the problem.
 
Old 01-17-2011, 03:12 AM   #12
olego
Member
 
Registered: Sep 2008
Location: Kaliningrad, Russia
Distribution: Slackware
Posts: 35

Original Poster
Rep: Reputation: 1
On the corsair forum it was suggested me to use memtest86+ and don't worry about errors produced by memtest86.

So I rose a DRAM Voltage from 1.8V upto 2.1V, set timings manually according to DataSheet (5-5-5-15 instead of 5-5-5-18) and so far so good. About two weeks without hangs.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Interpret info contained in /var/log/mcelog? robotsari Linux - Software 1 07-28-2010 02:41 PM
Hardware or software problem ? czezz Linux - Networking 3 04-18-2008 11:10 AM
mcelog for Slackware 12 oddo Slackware 2 04-06-2008 07:07 AM
Internet and LAN access is unstable-is this a hardware problem or a software problem? soren625 Linux - Networking 1 06-07-2004 06:43 AM
software/hardware problem ? astrobase Linux - General 4 11-04-2003 07:42 AM


All times are GMT -5. The time now is 05:39 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration