LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 06-22-2010, 05:26 AM   #1
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Need help in troubleshooting/resolving random kernel panic on multiple servers


Hi all,

I'm having a strange problem on some of our Debian servers. It all started about three weeks ago when we moved our virtual environment (VMWare ESX3) from a SAN to a NAS (NetApp). At first I thought it had to do with that move but since the other 9 servers are working perfectly I eliminated that idea.

For over a year all 12 Debian 5 servers have been working great without mentionable failures. All servers are (where) up to date with the latest patches.

About three weeks ago I started having kernel panics with the following message on three of our servers:
Code:
Code: Bad EIP value
EIP [00000000] 0x0 SS:ESP 0068:f6d7da18

Kernel panic - not syncing: Fatal exception in interrupt
and other times it looks like just a dump of hexadecimal data.

The only difference between those 3 servers is that they have several mounted shares connecting to the NAS using CIFS. So I was thinking that it might have to do with an update of some kind in regards to smb.

I recovered an image from a month ago, before the troubles began, copied over the data and MySQL databases and configured the 'old environment with recent data' exactly the same with MySQL master-master replication, document synchronization and load balancing. This task I performed last night (no other way since it's a production environment). Up to this time neither of the two 'restored' servers had a kernel panic. The one that has not been restored is having one at random about every hour and a half. Following are the different versions between the 'at this time' working server(s) and the failing one:
Code:
		FAILING			WORKING
samba-common 	2:3.2.5-4lenny12	2:3.2.5-4lenny9
smbclient	2:3.2.5-4lenny12	2:3.2.5-4lenny9
smbfs 		2:3.2.5-4lenny12	2:3.2.5-4lenny9
Does anybody know if there is a reported bug in one of the recent versions of the abovementioned software packages?

Could someone help me out with this troubleshooting? I don't see any reports in the logs in regards to the kernel panic. When it happens the server just freezes mentioning the kernel panic and all I can do is restart. At this moment it only happens on my Nagios server (which also serves as a mail relay server).

While typing this message that one server crashed again. I'm attaching screenshots of two panics, one from yesterday and the other one from two minutes ago.

I've setup a test server with Debian Squeeze to see if those version cause the same problem. To test it I have all the network mounts configured on that server too and am executing a script to 'browse' through the folders in a similar way that my Nagios scripts do. The Debian Squeeze server has following version of the same software:
Code:
smbclient  2:3.4.8~dfsg-1
samba-common  2:3.4.8~dfsg-1
smbfs 2:4.5-2
I'm not sure that this (Samba related difference) is causing the kernel panic but I'v almost exluded any other option I can think of.

Any help is greatly appreciated.

Kind regards,

Eric

Last edited by EricTRA; 10-12-2010 at 03:22 AM.
 
Old 07-01-2010, 01:01 AM   #2
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805

Original Poster
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Hello all,

Is there really nobody that can give me some pointers, ideas, input on this problem?

Kind regards,

Eric
 
Old 07-01-2010, 05:23 AM   #3
halvy
Member
 
Registered: Aug 2005
Location: Anchorage, Alaska (soon EU, hopefully)
Distribution: Anything NOT SystemD (ie. M$) related.
Posts: 918

Rep: Reputation: 42
You said you 'restored' 2 of the servers that were having problems (after the move..) and they were working fine now..

Why not do the same to the 3rd.. and call it even.

*Sometimes* it just takes tooo long to hunt down an (undocumented) bug, when there are literally thousands of programs & programers, who have had input into your system.

I am not trying to discourage you, and surely you are free to hack away until you find the culprit.

I am the same way.. and one of the hardest things (for me) is to not let the constant flow of bugs and nuances, get in the way of my daily work and project goals (and time tables).
 
Old 07-02-2010, 12:42 AM   #4
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805

Original Poster
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Hi,

Thank you for your reply. I actually recovered the 3 'failing' servers from backed up images and haven't done any update/upgrade on neither of them to avoid having the same problem. The main disadvantage is that now on those servers I'm not up to date any more with the installed software. For my Nagios server this isn't a problem since it's only internally accessible. The other two servers however are accessible from the Internet, behind Firewall and Squid servers, but still I don't feel very comfortable with the servers not being updated.

I've set up a 'test' server with Debian Squeeze and the same software and CIFS mounts. Running a script that reads/writes to some testfolders on the mount points I'm trying to see if it fails or not. It's been running for over two weeks now without any problems. Hence, my next question: Is Debian Squeeze considered stable enough already to run in a production environment or should I stick with what I have and run not updated servers (which I don't prefer)?

Kind regards,

Eric
 
Old 07-02-2010, 03:01 AM   #5
halvy
Member
 
Registered: Aug 2005
Location: Anchorage, Alaska (soon EU, hopefully)
Distribution: Anything NOT SystemD (ie. M$) related.
Posts: 918

Rep: Reputation: 42
Concerning any updates.. or new programs/libs you've installed relating to this problem-- have you investigated any bug (reports) related to them?

I cannot tell you which is better.. although if your current system is ok.. in your eyes.. then I would trust that, as long as you are verifying things properly.
 
Old 07-02-2010, 04:21 AM   #6
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Neither google://site:bugs.debian.org+smbfs+CIFS+panic or google://site:lkml.org+CIFS+smbfs+panic show much leads. I'm wondering what kernel you run and what Kconfig CIFS_.* options are enabled. You could try and run the latest kernel version and see if that fixes things. You could rebuild the current kernel with CIFS_DEBUG2 enabled (or run a debug kernel?) and (patch and) use netdump to send debugging information to a remote netdump server?
 
1 members found this post helpful.
Old 07-02-2010, 07:29 AM   #7
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805

Original Poster
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Hi guys,

Thanks for your time and input. I'm running kernel 2.6.26-2-686 on Debian 5.0.4 only on those 3 servers, without installing any updates or upgrades. I get the occassional CIFS VFS error but other then that, those systems are running stable.

On the test server that I mentioned in a previous post I'm running kernel 2.6.32-5-686 (squeeze/sid) with all updates and possible upgrades installed.

@unSpawn: thanks for the hint. I'll check what's enabled on Monday and post here. Next step would be to install all missing updates/upgrades on an exact copy of a server, reconfigure the kernel as you indicated and see where that takes me.

Again, thank's a lot for your time and hints.

Kind regards,

Eric
 
Old 08-20-2010, 03:25 AM   #8
EricTRA
LQ Guru
 
Registered: May 2009
Location: Gibraltar, Gibraltar
Distribution: Fedora 20 with Awesome WM
Posts: 6,805

Original Poster
Blog Entries: 1

Rep: Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297Reputation: 1297
Hi,

I upgraded the missing packages after lots of time (more then one update/upgrade went by) and now it's all working like a charm for two weeks now. I'll keep my fingers crossed and hope for the best. Thanks for the help guys.

Kind regards,

Eric
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Do you use NTOP for troubleshooting on production servers? Ujjain Linux - Software 3 03-29-2009 11:01 AM
Help Troubleshooting Kernel Panic qaiser Linux - Kernel 1 11-02-2008 02:19 PM
Random Kernel panic - not syncing: Fatal exception in interrupt rexchen Linux - Software 1 08-23-2008 08:51 PM
Random kernel panic with PPP in 2.6.16.26 kernel sathya_tce1 Linux - Networking 0 08-18-2008 07:04 AM
Kernel panic while saving random seed SAnton Linux - Software 1 09-27-2005 03:01 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 12:49 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration