RHEL 5.4 (Tikanga) on "HP Proliant 380 G6" down, "kernel: Uhhuh. NMI received. ...."

pkhera_2001 · 08-23-2010, 01:39 AM

Hi!

We are facing an issues with HP Proliant 380 G6 server with following configuration:
2 X Intel Xeon Quad Core 2.40Ghz (E5530)
16 GB Ram 1333 Mhz
SAS Storage
Smart Array 4i Plus Controller.

OS Details:
RHEL 5.4 Tikanga
Kernel : 2.6.18-164.el5 SMP
gcc 4.1.2 20080704

We have setup Jboss with few internal applications on this server and these applications are configured to use very less resources, i.e 2Gb of Ram and an Apache server is also running on this server which is also for internal use, so very less traffic and load on the server.

Problem:
Server goes down into un-responsive (stalled) state with following error on the attached console
kernel: Uhhuh. NMI recieved for unknown reason b0,
kernel: You probably have a hardware problem with your RAM chips
Dazed and confused, but trying to continue

With this error on local attached console, server goes into non-responsive state and only reply for ping requests is returned and doesn't allow user to login.

I ran diagnostics on the server as suggested by HP support but they could not find anything faulty on the server and HP support suggested to consult with OS vender (Red Hat in my case).

Has anybody else on this forum encountered with such an issue? Any inputs and help will be highly appreciated.

Warm Regards,
Parveen Khera

born4linux · 08-23-2010, 02:14 AM

before you try and call RH - apply updates on the server.

TB0ne · 08-23-2010, 09:39 AM

Quote:

Originally Posted by pkhera_2001

Hi!

We are facing an issues with HP Proliant 380 G6 server with following configuration:
2 X Intel Xeon Quad Core 2.40Ghz (E5530)
16 GB Ram 1333 Mhz
SAS Storage
Smart Array 4i Plus Controller.

OS Details:
RHEL 5.4 Tikanga
Kernel : 2.6.18-164.el5 SMP
gcc 4.1.2 20080704

We have setup Jboss with few internal applications on this server and these applications are configured to use very less resources, i.e 2Gb of Ram and an Apache server is also running on this server which is also for internal use, so very less traffic and load on the server.

Problem:
Server goes down into un-responsive (stalled) state with following error on the attached console
kernel: Uhhuh. NMI recieved for unknown reason b0,
kernel: You probably have a hardware problem with your RAM chips
Dazed and confused, but trying to continue

With this error on local attached console, server goes into non-responsive state and only reply for ping requests is returned and doesn't allow user to login.

I ran diagnostics on the server as suggested by HP support but they could not find anything faulty on the server and HP support suggested to consult with OS vender (Red Hat in my case).

Has anybody else on this forum encountered with such an issue? Any inputs and help will be highly appreciated.

Did you pay attention to the error???

Quote:

Originally Posted by pkhera_2001

kernel: You probably have a hardware problem with your RAM chips

HP diagnostics run basic tests, and don't extensively test your RAM chips. As was suggested, apply any updates (since you're paying for RHEL, that's easily done through the RedHat network), and did you consult with RHEL, as was suggested?

After that, try replacing the RAM chips, or running extensive memory tests. The error and the conditions that cause it, are very clear.

pkhera_2001 · 08-23-2010, 11:31 AM

Hi, Thanks for your replies,

I already ran thorough diagnostics which took more than 8 hours to run and I submitted report for the same to HP but still HP Support suggested that there is nothing wrong with Hardware/RAM and suggested to contact RedHat Support.

Now I have updated the OS with RHN as suggested above, now it's running with new updates but still older kernel (2.6.18-164.el5 SMP), new updated and installed kernel is (2.6.18-196.el5 SMP). I have planned to boot the server with new kernel if still box goes down with the same error.

If still after kernel update machine goes down, I'll contact RedHat support and will follow up them as well.

Thanks for all your inputs, still I will be looking for Guidance from Experts......

Warm Regards,
Parveen Khera

pkhera_2001 · 08-25-2010, 01:18 AM

Hi!

Server went down again yesterday and I booted it with new kernel, it's up since last ~11:30 hours.

Let us see if this goes down again and I'll try to reach RH.

While troubleshooting it I found that Server Load shows too much in top output (during last down it was noticed 18 where as server has 8 cores only), Ram usage also went high than normal but that should not cause to bring the system down as 6G memory was still free on the machine but there were 4 Zombie processes.

I am expecting some inputs on monitoring the server for current state, currently I am monitoring it through a shell script and using top utility.

Can anybody suggest me if taking kernel dump will help me and how to setup to take the kernel dump, any online reference, link to how to...?

Warm Regards,

redhawk1973 · 08-27-2010, 02:17 PM

I hope this helps

http://magazine.redhat.com/2007/08/1...dump-analysis/

Soadyheid · 08-27-2010, 06:03 PM

Have you tried updating the firmware on the System? Bootable CD available at:

http://h20000.www2.hp.com/bizsupport...&cc=us&mode=3&

You can also use it to make a bootable USB stick if the system doesn't have a CD/DVDROM (I know the DL360 G6 hasn't a CD/DVD, not sure of the DL380 G6.) HP have a USB key utility do create if required. I would have thought that HP's SmartStart Diags would have been OK for testing the RAM. (Currently V8.70, least that's what I've got

) The status page usually tells you the state of each DIMM (It picked out some dodgy DIMMS on my ML370 G4) and running a load of test passes should exercise the RAM enough.

Hope that's helpful

Play Bonny!

pkhera_2001 · 08-30-2010, 03:42 AM

redhawk1973, thanks for sharing above listed link.

Soadyheid, thanks to you also for providing the link to download hp firmware ISO cd.

I would like to share the current status: As Server was updated and booted up with new kernel which is 2.6.18-196.el5 SMP, it has been up since last 5days 14:00 hrs.

I am not sure if new kernel setup has resolved the issue because last time when i changed NMI watchdog settings to 0, server worked fine for 10 days, so I will keep this server monitoring for few more days (at least 5 more days) and if still there is an issue then I'll take further steps (firmware upgrade, kernel dump ..).

Thanks so much for providing your inputs.

TB0ne · 08-30-2010, 09:48 AM

Quote:

Originally Posted by pkhera_2001

redhawk1973, thanks for sharing above listed link.

Soadyheid, thanks to you also for providing the link to download hp firmware ISO cd.

I would like to share the current status: As Server was updated and booted up with new kernel which is 2.6.18-196.el5 SMP, it has been up since last 5days 14:00 hrs.

I am not sure if new kernel setup has resolved the issue because last time when i changed NMI watchdog settings to 0, server worked fine for 10 days, so I will keep this server monitoring for few more days (at least 5 more days) and if still there is an issue then I'll take further steps (firmware upgrade, kernel dump ..).

Thanks so much for providing your inputs.

All this is great and all, and it's good you're getting things updated. But are you thinking about what's going on? You say the server WAS up for 10 days, then crashed, etc. and you're going through all these server/kernel updates. The error clearly states that it's detecting bad RAM. You could have a problem with the RAM on the system that's intermittent, or it could be as simple as the SIMM's have worked loose.

All the kernel patches in the world won't fix bad hardware. Since you've changed lots before, and STILL had a crash, I'd be willing to bet it crashes again. RAM is cheap, and if this is a production server, it's a good precaution to replace it if you get errors about it.

pkhera_2001 · 08-30-2010, 11:22 PM

TBone,

It is correct that the error stated bad RAM but as I already mentioned, that i ran quick and full diagnostics with the help of diagnostic disk provided by HP and there were no issue found.

Regarding replacement of RAM, I am afraid to loose warranty from HP if I do it myself, so I am following the methods suggested by HP Support and Experts at LQ.

Yes, this is a production server and I would like to change the RAM chip if found faulty.

Thanks,

TB0ne · 08-31-2010, 08:48 AM

Quote:

Originally Posted by pkhera_2001

TBone,

It is correct that the error stated bad RAM but as I already mentioned, that i ran quick and full diagnostics with the help of diagnostic disk provided by HP and there were no issue found.

Regarding replacement of RAM, I am afraid to loose warranty from HP if I do it myself, so I am following the methods suggested by HP Support and Experts at LQ.

Yes, this is a production server and I would like to change the RAM chip if found faulty.

Thanks,

Right...again, if the problem is INTERMITTENT, the 'quick and full' diagnostics provided by HP may not catch it, if it doesn't act up when you're running them.

And if it's all under warranty and support. they will replace the ram for FREE, if you ask them to, especially if you provide that log indicating an error in the RAM.

Soadyheid · 08-31-2010, 06:12 PM

@ TBOne

Quote:

kernel: Uhhuh. NMI recieved for unknown reason b0,
kernel: You probably have a hardware problem with your RAM chips

The error doesn't "Clearly" indicate that it's detecting bad RAM... The person who coded this particular part of the kernel thought that "You probably have a hardware problem with your RAM chips" due to the Non Maskable Interrupt, so he's "probably" guessing.

My favourite is a message I got on an old SparcStation aeons ago which hung after panicing, "I've had a problem, I'm just waiting till it goes away"

@ pkhera 2001
Have you checked the Insight Management log which has nothing whatsoever to do with any operating system running on the system? What HP Diagnostics have you been running? You can download HP's latest SmartStart Diagnostic CD at:
http://h18000.www1.hp.com/products/s...ement-100.html

I've been using Version 8.40 32bit which was supplied with a DL360 G6 so it should suite you.
The system survey shows which DIMM slots are populated, the DIMM's capacity and speed. You then get, per DIMM,the memory type, description, spare part No for ordering if required.
You're also told the Correctable error threshold status, Uncorrectable error status, Correctable error threshold count and Uncorrectable error count.
Go to "Test" and check "All memory tests" and it'll run the Address test, Read test, March test, Noise test and Walk test. A single loop on my ML370 G4 with 8Gb memory takes about a couple of minutes. If you've got time, stick in 1,000 passes or so an let 'er run. Between the status and diag I'd think you should get some sort of indication.
As mentioned earlier, you can also check the logs from here... Go to "Logs" and select the Insight Management Log tab. You'll get a line by line entry of what's happened to the box since it was installed (Unless someone has cleared it!!

) It lists Severity; caution, Failed. Repaired, etc... Class; POST, Power Supply, Network, etc... Last update,... Initial update... Count... and Description. Common stuff is loss of redundant power, Network line down, Cache battery recharging.

Enough for now,

Play Bonny!

pkhera_2001 · 09-01-2010, 03:37 AM

Hi!

TB0ne, Yes you are correct that HP should replace these chips and I have already shared crash error logs with HP Support and they shared with me, an online link for HP's SmartStart disk which I used to run the diagnostics.

Soadyheid:

Quote:

The error doesn't "Clearly" indicate that it's detecting bad RAM... The person who coded this particular part of the kernel thought that "You probably have a hardware problem with your RAM chips" due to the Non Maskable Interrupt, so he's "probably" guessing.

Yes, this seems the reason that why HP Support suggested to log a bug with OS Vendor, after analyzing the diagnostics logs which I generated with HP's SmartStart disk. I too tried to find out if there is any error listed in logs but could not find anything.

Update on Server in question is, that server is up since last 7 days and 14 hrs.

Thanks,