LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices



Reply
 
Search this Thread
Old 08-23-2010, 02:39 AM   #1
pkhera_2001
Member
 
Registered: Mar 2006
Location: New Delhi, India
Distribution: Fedora, CentOS, RHEL, Ubuntu
Posts: 67

Rep: Reputation: 18
RHEL 5.4 (Tikanga) on "HP Proliant 380 G6" down, "kernel: Uhhuh. NMI received. ...."


Hi!

We are facing an issues with HP Proliant 380 G6 server with following configuration:
2 X Intel Xeon Quad Core 2.40Ghz (E5530)
16 GB Ram 1333 Mhz
SAS Storage
Smart Array 4i Plus Controller.

OS Details:
RHEL 5.4 Tikanga
Kernel : 2.6.18-164.el5 SMP
gcc 4.1.2 20080704

We have setup Jboss with few internal applications on this server and these applications are configured to use very less resources, i.e 2Gb of Ram and an Apache server is also running on this server which is also for internal use, so very less traffic and load on the server.

Problem:
Server goes down into un-responsive (stalled) state with following error on the attached console
kernel: Uhhuh. NMI recieved for unknown reason b0,
kernel: You probably have a hardware problem with your RAM chips
Dazed and confused, but trying to continue

With this error on local attached console, server goes into non-responsive state and only reply for ping requests is returned and doesn't allow user to login.

I ran diagnostics on the server as suggested by HP support but they could not find anything faulty on the server and HP support suggested to consult with OS vender (Red Hat in my case).

Has anybody else on this forum encountered with such an issue? Any inputs and help will be highly appreciated.

Warm Regards,
Parveen Khera
 
Old 08-23-2010, 03:14 AM   #2
born4linux
Senior Member
 
Registered: Sep 2002
Location: Philippines
Distribution: Slackware, RHEL&variants, AIX, SuSE
Posts: 1,127

Rep: Reputation: 49
before you try and call RH - apply updates on the server.
 
Old 08-23-2010, 10:39 AM   #3
TB0ne
Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 15,102

Rep: Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719
Quote:
Originally Posted by pkhera_2001 View Post
Hi!

We are facing an issues with HP Proliant 380 G6 server with following configuration:
2 X Intel Xeon Quad Core 2.40Ghz (E5530)
16 GB Ram 1333 Mhz
SAS Storage
Smart Array 4i Plus Controller.

OS Details:
RHEL 5.4 Tikanga
Kernel : 2.6.18-164.el5 SMP
gcc 4.1.2 20080704

We have setup Jboss with few internal applications on this server and these applications are configured to use very less resources, i.e 2Gb of Ram and an Apache server is also running on this server which is also for internal use, so very less traffic and load on the server.

Problem:
Server goes down into un-responsive (stalled) state with following error on the attached console
kernel: Uhhuh. NMI recieved for unknown reason b0,
kernel: You probably have a hardware problem with your RAM chips
Dazed and confused, but trying to continue

With this error on local attached console, server goes into non-responsive state and only reply for ping requests is returned and doesn't allow user to login.

I ran diagnostics on the server as suggested by HP support but they could not find anything faulty on the server and HP support suggested to consult with OS vender (Red Hat in my case).

Has anybody else on this forum encountered with such an issue? Any inputs and help will be highly appreciated.
Did you pay attention to the error???
Quote:
Originally Posted by pkhera_2001
kernel: You probably have a hardware problem with your RAM chips
HP diagnostics run basic tests, and don't extensively test your RAM chips. As was suggested, apply any updates (since you're paying for RHEL, that's easily done through the RedHat network), and did you consult with RHEL, as was suggested?

After that, try replacing the RAM chips, or running extensive memory tests. The error and the conditions that cause it, are very clear.
 
Old 08-23-2010, 12:31 PM   #4
pkhera_2001
Member
 
Registered: Mar 2006
Location: New Delhi, India
Distribution: Fedora, CentOS, RHEL, Ubuntu
Posts: 67

Original Poster
Rep: Reputation: 18
Hi, Thanks for your replies,

I already ran thorough diagnostics which took more than 8 hours to run and I submitted report for the same to HP but still HP Support suggested that there is nothing wrong with Hardware/RAM and suggested to contact RedHat Support.

Now I have updated the OS with RHN as suggested above, now it's running with new updates but still older kernel (2.6.18-164.el5 SMP), new updated and installed kernel is (2.6.18-196.el5 SMP). I have planned to boot the server with new kernel if still box goes down with the same error.

If still after kernel update machine goes down, I'll contact RedHat support and will follow up them as well.

Thanks for all your inputs, still I will be looking for Guidance from Experts......

Warm Regards,
Parveen Khera
 
Old 08-25-2010, 02:18 AM   #5
pkhera_2001
Member
 
Registered: Mar 2006
Location: New Delhi, India
Distribution: Fedora, CentOS, RHEL, Ubuntu
Posts: 67

Original Poster
Rep: Reputation: 18
Hi!

Server went down again yesterday and I booted it with new kernel, it's up since last ~11:30 hours.

Let us see if this goes down again and I'll try to reach RH.

While troubleshooting it I found that Server Load shows too much in top output (during last down it was noticed 18 where as server has 8 cores only), Ram usage also went high than normal but that should not cause to bring the system down as 6G memory was still free on the machine but there were 4 Zombie processes.

I am expecting some inputs on monitoring the server for current state, currently I am monitoring it through a shell script and using top utility.

Can anybody suggest me if taking kernel dump will help me and how to setup to take the kernel dump, any online reference, link to how to...?

Warm Regards,

Last edited by pkhera_2001; 08-25-2010 at 02:44 AM.
 
Old 08-27-2010, 03:17 PM   #6
redhawk1973
Member
 
Registered: Jul 2003
Location: Woodbridge VA
Distribution: Red Hat, Suse, AIX, Fedora, Cent OS, Ubuntu, Mint Linux
Posts: 58

Rep: Reputation: 18
I hope this helps

http://magazine.redhat.com/2007/08/1...dump-analysis/
 
1 members found this post helpful.
Old 08-27-2010, 07:03 PM   #7
Soadyheid
Member
 
Registered: Aug 2010
Location: Near Edinburgh, Scotland
Posts: 835

Rep: Reputation: 150Reputation: 150
Have you tried updating the firmware on the System? Bootable CD available at:

http://h20000.www2.hp.com/bizsupport...&cc=us&mode=3&

You can also use it to make a bootable USB stick if the system doesn't have a CD/DVDROM (I know the DL360 G6 hasn't a CD/DVD, not sure of the DL380 G6.) HP have a USB key utility do create if required. I would have thought that HP's SmartStart Diags would have been OK for testing the RAM. (Currently V8.70, least that's what I've got ) The status page usually tells you the state of each DIMM (It picked out some dodgy DIMMS on my ML370 G4) and running a load of test passes should exercise the RAM enough.

Hope that's helpful

Play Bonny!
 
1 members found this post helpful.
Old 08-30-2010, 04:42 AM   #8
pkhera_2001
Member
 
Registered: Mar 2006
Location: New Delhi, India
Distribution: Fedora, CentOS, RHEL, Ubuntu
Posts: 67

Original Poster
Rep: Reputation: 18
redhawk1973, thanks for sharing above listed link.

Soadyheid, thanks to you also for providing the link to download hp firmware ISO cd.

I would like to share the current status: As Server was updated and booted up with new kernel which is 2.6.18-196.el5 SMP, it has been up since last 5days 14:00 hrs.

I am not sure if new kernel setup has resolved the issue because last time when i changed NMI watchdog settings to 0, server worked fine for 10 days, so I will keep this server monitoring for few more days (at least 5 more days) and if still there is an issue then I'll take further steps (firmware upgrade, kernel dump ..).

Thanks so much for providing your inputs.
 
Old 08-30-2010, 10:48 AM   #9
TB0ne
Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 15,102

Rep: Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719
Quote:
Originally Posted by pkhera_2001 View Post
redhawk1973, thanks for sharing above listed link.

Soadyheid, thanks to you also for providing the link to download hp firmware ISO cd.

I would like to share the current status: As Server was updated and booted up with new kernel which is 2.6.18-196.el5 SMP, it has been up since last 5days 14:00 hrs.

I am not sure if new kernel setup has resolved the issue because last time when i changed NMI watchdog settings to 0, server worked fine for 10 days, so I will keep this server monitoring for few more days (at least 5 more days) and if still there is an issue then I'll take further steps (firmware upgrade, kernel dump ..).

Thanks so much for providing your inputs.
All this is great and all, and it's good you're getting things updated. But are you thinking about what's going on? You say the server WAS up for 10 days, then crashed, etc. and you're going through all these server/kernel updates. The error clearly states that it's detecting bad RAM. You could have a problem with the RAM on the system that's intermittent, or it could be as simple as the SIMM's have worked loose.

All the kernel patches in the world won't fix bad hardware. Since you've changed lots before, and STILL had a crash, I'd be willing to bet it crashes again. RAM is cheap, and if this is a production server, it's a good precaution to replace it if you get errors about it.
 
Old 08-31-2010, 12:22 AM   #10
pkhera_2001
Member
 
Registered: Mar 2006
Location: New Delhi, India
Distribution: Fedora, CentOS, RHEL, Ubuntu
Posts: 67

Original Poster
Rep: Reputation: 18
TBone,

It is correct that the error stated bad RAM but as I already mentioned, that i ran quick and full diagnostics with the help of diagnostic disk provided by HP and there were no issue found.

Regarding replacement of RAM, I am afraid to loose warranty from HP if I do it myself, so I am following the methods suggested by HP Support and Experts at LQ.

Yes, this is a production server and I would like to change the RAM chip if found faulty.

Thanks,
 
Old 08-31-2010, 09:48 AM   #11
TB0ne
Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 15,102

Rep: Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719Reputation: 2719
Quote:
Originally Posted by pkhera_2001 View Post
TBone,

It is correct that the error stated bad RAM but as I already mentioned, that i ran quick and full diagnostics with the help of diagnostic disk provided by HP and there were no issue found.

Regarding replacement of RAM, I am afraid to loose warranty from HP if I do it myself, so I am following the methods suggested by HP Support and Experts at LQ.

Yes, this is a production server and I would like to change the RAM chip if found faulty.

Thanks,
Right...again, if the problem is INTERMITTENT, the 'quick and full' diagnostics provided by HP may not catch it, if it doesn't act up when you're running them.

And if it's all under warranty and support. they will replace the ram for FREE, if you ask them to, especially if you provide that log indicating an error in the RAM.
 
Old 08-31-2010, 07:12 PM   #12
Soadyheid
Member
 
Registered: Aug 2010
Location: Near Edinburgh, Scotland
Posts: 835

Rep: Reputation: 150Reputation: 150
@ TBOne
Quote:
kernel: Uhhuh. NMI recieved for unknown reason b0,
kernel: You probably have a hardware problem with your RAM chips
The error doesn't "Clearly" indicate that it's detecting bad RAM... The person who coded this particular part of the kernel thought that "You probably have a hardware problem with your RAM chips" due to the Non Maskable Interrupt, so he's "probably" guessing.

My favourite is a message I got on an old SparcStation aeons ago which hung after panicing, "I've had a problem, I'm just waiting till it goes away"

@ pkhera 2001
Have you checked the Insight Management log which has nothing whatsoever to do with any operating system running on the system? What HP Diagnostics have you been running? You can download HP's latest SmartStart Diagnostic CD at:
http://h18000.www1.hp.com/products/s...ement-100.html

I've been using Version 8.40 32bit which was supplied with a DL360 G6 so it should suite you.
The system survey shows which DIMM slots are populated, the DIMM's capacity and speed. You then get, per DIMM,the memory type, description, spare part No for ordering if required.
You're also told the Correctable error threshold status, Uncorrectable error status, Correctable error threshold count and Uncorrectable error count.
Go to "Test" and check "All memory tests" and it'll run the Address test, Read test, March test, Noise test and Walk test. A single loop on my ML370 G4 with 8Gb memory takes about a couple of minutes. If you've got time, stick in 1,000 passes or so an let 'er run. Between the status and diag I'd think you should get some sort of indication.
As mentioned earlier, you can also check the logs from here... Go to "Logs" and select the Insight Management Log tab. You'll get a line by line entry of what's happened to the box since it was installed (Unless someone has cleared it!! ) It lists Severity; caution, Failed. Repaired, etc... Class; POST, Power Supply, Network, etc... Last update,... Initial update... Count... and Description. Common stuff is loss of redundant power, Network line down, Cache battery recharging.

Enough for now,

Play Bonny!
 
Old 09-01-2010, 04:37 AM   #13
pkhera_2001
Member
 
Registered: Mar 2006
Location: New Delhi, India
Distribution: Fedora, CentOS, RHEL, Ubuntu
Posts: 67

Original Poster
Rep: Reputation: 18
Hi!

TB0ne, Yes you are correct that HP should replace these chips and I have already shared crash error logs with HP Support and they shared with me, an online link for HP's SmartStart disk which I used to run the diagnostics.

Soadyheid:
Quote:
The error doesn't "Clearly" indicate that it's detecting bad RAM... The person who coded this particular part of the kernel thought that "You probably have a hardware problem with your RAM chips" due to the Non Maskable Interrupt, so he's "probably" guessing.
Yes, this seems the reason that why HP Support suggested to log a bug with OS Vendor, after analyzing the diagnostics logs which I generated with HP's SmartStart disk. I too tried to find out if there is any error listed in logs but could not find anything.

Update on Server in question is, that server is up since last 7 days and 14 hrs.

Thanks,
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how can I "cat" or "grep" a file to ignore lines starting with "#" ??? callagga Linux - Newbie 7 08-16-2013 07:58 AM
NMI received + "Problem with RAM chips" hphinizy Linux - Enterprise 6 03-19-2009 01:37 AM
net working eth0 eth1 wlan0 "no connection" "no LAN" "no wi-fi" Cayitano Linux - Newbie 5 12-09-2007 08:11 PM
Standard commands give "-bash: open: command not found" even in "su -" and "su root" mibo12 Linux - General 4 11-11-2007 11:18 PM
LXer: Displaying "MyComputer", "Trash", "Network Servers" Icons On A GNOME Desktop LXer Syndicated Linux News 0 04-02-2007 09:31 AM


All times are GMT -5. The time now is 02:23 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration