LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Enterprise Linux Forums > Linux - Enterprise
User Name
Password
Linux - Enterprise This forum is for all items relating to using Linux in the Enterprise.

Notices

Reply
 
Search this Thread
Old 02-18-2008, 02:17 PM   #1
hphinizy
LQ Newbie
 
Registered: Jun 2007
Posts: 5

Rep: Reputation: 0
NMI received + "Problem with RAM chips"


This is a Oracle DB server that was recently under heavy load due to a refresh of some DB data. While this refresh was going on the system behaved erratically. While perusing the kernel logs I came across this which, shows up on different dates over the last six months:

Feb 16 06:23:21 prod-01 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
Feb 16 06:23:21 prod-01 kernel: You probably have a hardware problem with your RAM chips

After the DB refresh was complete the server did not recover adequately. I took the server down to reseat the RAM and when it came back the server behaved fine however, I am still getting the NMI errors. (I have also changed out the RAM with RAM from the stage system--NMI errors persist). Memtest86 ran for 24 hours, no errors.

I wonder if you folks can shed some light on this?

Thanks,

H
 
Old 02-19-2008, 12:05 AM   #2
dkm999
Member
 
Registered: Nov 2006
Location: Seattle, WA
Distribution: Fedora
Posts: 407

Rep: Reputation: 35
Aren't hardware diagnostics wonderful? My quicko Google search of NMI "dazed and confused" turned up several possibly relevant threads dating back some time. The thrust of these seems to be that, under normal operations, NMI (non-maskable interrupt) is not supposed to happen. When it does, the most likely event in the opinion of the writer of this code is that a parity or ECC error has been detected by the main memory controller.

Naturally, that is not the only source of such interrupts. In fact, in lots of machines today, memory data is not even checked. (If your machine has a x64 memory, it does not check parity; if it has a x72 memory, it probably has ECC checking.) One thread indicated that there was a problem with the code for an ethernet driver that caused this message to appear. Without knowing a great deal more about your hardware and what versions of everything you are running, it is pretty hard to be specific about the cause of this problem in your case. But there is quite a bit of info on the 'Net about how to eliminate some of the possibilities; if you can narrow it down more, I'm pretty sure someone reading this list will have a suggestion about further experiments that you could do.

Of course, since this is a production server, I guess the very first thing you are going to have to do is replace the whole machine with a spare, and then start running experiments on the suspect unit after it is out of the critical path.

Good luck.
 
Old 02-19-2008, 10:37 AM   #3
slacksite
LQ Newbie
 
Registered: Feb 2008
Posts: 12

Rep: Reputation: 0
Looking at the code confirms that this is some sort of memory parity error:

721static __kprobes void
722mem_parity_error(unsigned char reason, struct pt_regs * regs)
723{
724 printk(KERN_EMERG "Uhhuh. NMI received for unknown reason %02x.\n",
725 reason);
726 printk(KERN_EMERG "You have some hardware problem, likely on the PCI bus.\n");
727
728#if defined(CONFIG_EDAC)
729 if(edac_handler_set()) {
730 edac_atomic_assert_error();
731 return;
732 }
733#endif
734
735 if (panic_on_unrecovered_nmi)
736 panic("NMI: Not continuing");
737
738 printk(KERN_EMERG "Dazed and confused, but trying to continue\n");
739
740 /* Clear and disable the memory parity error line. */
741 reason = (reason & 0xf) | 4;
742 outb(reason, 0x61);
743}
 
Old 09-24-2008, 02:35 AM   #4
vimal
Red Hat India
 
Registered: Nov 2004
Location: Kerala/Pune,india
Distribution: RedHat, Fedora
Posts: 260

Rep: Reputation: 34
Hello,

The NMI watchdog kicks in when there is some sort of error with the hardware, mostly the main memory. You must check the hardware for any faulty ones. Running memtest would probably reveal it, but have seen where memtest did not find anything but the NMI error persist.

Vimal
 
Old 03-18-2009, 02:44 AM   #5
gsnravi
LQ Newbie
 
Registered: Feb 2009
Location: hyderabad
Posts: 5

Rep: Reputation: 0
Hi everyone,

I have this problem with my servers, OS is linux. I tested all the tests like memory tests with certian loops, it was not showing any hardware error in the logs and i verified logs in the linux i unable find this error in the linux logs. it is not related to even firmware issue i updated also.

I don't know how to proove this error related to hardware issue. because vendor requires failed logs to replace parts..

if any one knows the solution, pl. share it...

Ravi
 
Old 03-18-2009, 04:00 AM   #6
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 61
I also have this log message on an old Dell server, and I can't relate any problem I ever had to this log… Is it simply “fake”?

Yves.
 
Old 03-19-2009, 01:37 AM   #7
gsnravi
LQ Newbie
 
Registered: Feb 2009
Location: hyderabad
Posts: 5

Rep: Reputation: 0
No no, it is not simply fake? it is a serious issue, becaze in the same system with windows working fine and no other health issues, but linux it shows health issues. There was some hardware issue belongs to RAM but in the hardware level it is not able to proove.

Ravi
 
  


Reply

Tags
nmi


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
saitek usb keyboard "input irq status -32 received" boffman Linux - Hardware 3 12-30-2007 09:46 PM
Could someone explain wireless kernel message "TKIP: received packet without ExtIV" jschiwal Linux - Wireless Networking 2 12-17-2007 02:01 PM
NMI received on recovery from suspend to disk acampbell Linux - Laptop and Netbook 0 08-31-2007 04:52 PM
"Club-goers in Spain get implanted chips for ID, payment purposes" furfurdemon666 General 85 10-18-2005 10:42 PM
NMI received hemanth_13 Linux - General 3 07-11-2003 10:12 AM


All times are GMT -5. The time now is 02:53 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration