Linux - EnterpriseThis forum is for all items relating to using Linux in the Enterprise.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
This is a Oracle DB server that was recently under heavy load due to a refresh of some DB data. While this refresh was going on the system behaved erratically. While perusing the kernel logs I came across this which, shows up on different dates over the last six months:
Feb 16 06:23:21 prod-01 kernel: Uhhuh. NMI received. Dazed and confused, but trying to continue
Feb 16 06:23:21 prod-01 kernel: You probably have a hardware problem with your RAM chips
After the DB refresh was complete the server did not recover adequately. I took the server down to reseat the RAM and when it came back the server behaved fine however, I am still getting the NMI errors. (I have also changed out the RAM with RAM from the stage system--NMI errors persist). Memtest86 ran for 24 hours, no errors.
I wonder if you folks can shed some light on this?
Aren't hardware diagnostics wonderful? My quicko Google search of NMI "dazed and confused" turned up several possibly relevant threads dating back some time. The thrust of these seems to be that, under normal operations, NMI (non-maskable interrupt) is not supposed to happen. When it does, the most likely event in the opinion of the writer of this code is that a parity or ECC error has been detected by the main memory controller.
Naturally, that is not the only source of such interrupts. In fact, in lots of machines today, memory data is not even checked. (If your machine has a x64 memory, it does not check parity; if it has a x72 memory, it probably has ECC checking.) One thread indicated that there was a problem with the code for an ethernet driver that caused this message to appear. Without knowing a great deal more about your hardware and what versions of everything you are running, it is pretty hard to be specific about the cause of this problem in your case. But there is quite a bit of info on the 'Net about how to eliminate some of the possibilities; if you can narrow it down more, I'm pretty sure someone reading this list will have a suggestion about further experiments that you could do.
Of course, since this is a production server, I guess the very first thing you are going to have to do is replace the whole machine with a spare, and then start running experiments on the suspect unit after it is out of the critical path.
The NMI watchdog kicks in when there is some sort of error with the hardware, mostly the main memory. You must check the hardware for any faulty ones. Running memtest would probably reveal it, but have seen where memtest did not find anything but the NMI error persist.
I have this problem with my servers, OS is linux. I tested all the tests like memory tests with certian loops, it was not showing any hardware error in the logs and i verified logs in the linux i unable find this error in the linux logs. it is not related to even firmware issue i updated also.
I don't know how to proove this error related to hardware issue. because vendor requires failed logs to replace parts..
No no, it is not simply fake? it is a serious issue, becaze in the same system with windows working fine and no other health issues, but linux it shows health issues. There was some hardware issue belongs to RAM but in the hardware level it is not able to proove.