[SOLVED] Need help in troubleshooting/resolving random kernel panic on multiple servers

EricTRA · 06-22-2010, 05:26 AM

Hi all,

I'm having a strange problem on some of our Debian servers. It all started about three weeks ago when we moved our virtual environment (VMWare ESX3) from a SAN to a NAS (NetApp). At first I thought it had to do with that move but since the other 9 servers are working perfectly I eliminated that idea.

For over a year all 12 Debian 5 servers have been working great without mentionable failures. All servers are (where) up to date with the latest patches.

About three weeks ago I started having kernel panics with the following message on three of our servers:

Code:

Code: Bad EIP value
EIP [00000000] 0x0 SS:ESP 0068:f6d7da18

Kernel panic - not syncing: Fatal exception in interrupt

and other times it looks like just a dump of hexadecimal data.

The only difference between those 3 servers is that they have several mounted shares connecting to the NAS using CIFS. So I was thinking that it might have to do with an update of some kind in regards to smb.

I recovered an image from a month ago, before the troubles began, copied over the data and MySQL databases and configured the 'old environment with recent data' exactly the same with MySQL master-master replication, document synchronization and load balancing. This task I performed last night (no other way since it's a production environment). Up to this time neither of the two 'restored' servers had a kernel panic. The one that has not been restored is having one at random about every hour and a half. Following are the different versions between the 'at this time' working server(s) and the failing one:

Code:

		FAILING			WORKING
samba-common 	2:3.2.5-4lenny12	2:3.2.5-4lenny9
smbclient	2:3.2.5-4lenny12	2:3.2.5-4lenny9
smbfs 		2:3.2.5-4lenny12	2:3.2.5-4lenny9

Does anybody know if there is a reported bug in one of the recent versions of the abovementioned software packages?

Could someone help me out with this troubleshooting? I don't see any reports in the logs in regards to the kernel panic. When it happens the server just freezes mentioning the kernel panic and all I can do is restart. At this moment it only happens on my Nagios server (which also serves as a mail relay server).

While typing this message that one server crashed again. I'm attaching screenshots of two panics, one from yesterday and the other one from two minutes ago.

I've setup a test server with Debian Squeeze to see if those version cause the same problem. To test it I have all the network mounts configured on that server too and am executing a script to 'browse' through the folders in a similar way that my Nagios scripts do. The Debian Squeeze server has following version of the same software:

Code:

smbclient  2:3.4.8~dfsg-1
samba-common  2:3.4.8~dfsg-1
smbfs 2:4.5-2

I'm not sure that this (Samba related difference) is causing the kernel panic but I'v almost exluded any other option I can think of.

Any help is greatly appreciated.

Kind regards,

Eric

EricTRA · 07-01-2010, 01:01 AM

Hello all,

Is there really nobody that can give me some pointers, ideas, input on this problem?

Kind regards,

Eric

halvy · 07-01-2010, 05:23 AM

You said you 'restored' 2 of the servers that were having problems (after the move..) and they were working fine now..

Why not do the same to the 3rd.. and call it even.

*Sometimes* it just takes tooo long to hunt down an (undocumented) bug, when there are literally thousands of programs & programers, who have had input into your system.

I am not trying to discourage you, and surely you are free to hack away until you find the culprit.

I am the same way.. and one of the hardest things (for me) is to not let the constant flow of bugs and nuances, get in the way of my daily work and project goals (and time tables).

EricTRA · 07-02-2010, 12:42 AM

Hi,

Thank you for your reply. I actually recovered the 3 'failing' servers from backed up images and haven't done any update/upgrade on neither of them to avoid having the same problem. The main disadvantage is that now on those servers I'm not up to date any more with the installed software. For my Nagios server this isn't a problem since it's only internally accessible. The other two servers however are accessible from the Internet, behind Firewall and Squid servers, but still I don't feel very comfortable with the servers not being updated.

I've set up a 'test' server with Debian Squeeze and the same software and CIFS mounts. Running a script that reads/writes to some testfolders on the mount points I'm trying to see if it fails or not. It's been running for over two weeks now without any problems. Hence, my next question: Is Debian Squeeze considered stable enough already to run in a production environment or should I stick with what I have and run not updated servers (which I don't prefer)?

Kind regards,

Eric

halvy · 07-02-2010, 03:01 AM

Concerning any updates.. or new programs/libs you've installed relating to this problem-- have you investigated any bug (reports) related to them?

I cannot tell you which is better.. although if your current system is ok.. in your eyes.. then I would trust that, as long as you are verifying things properly.

unSpawn · 07-02-2010, 04:21 AM

Neither google://site:bugs.debian.org+smbfs+CIFS+panic or google://site:lkml.org+CIFS+smbfs+panic show much leads. I'm wondering what kernel you run and what Kconfig CIFS_.* options are enabled. You could try and run the latest kernel version and see if that fixes things. You could rebuild the current kernel with CIFS_DEBUG2 enabled (or run a debug kernel?) and (patch and) use netdump to send debugging information to a remote netdump server?

EricTRA · 07-02-2010, 07:29 AM

Hi guys,

Thanks for your time and input. I'm running kernel 2.6.26-2-686 on Debian 5.0.4 only on those 3 servers, without installing any updates or upgrades. I get the occassional CIFS VFS error but other then that, those systems are running stable.

On the test server that I mentioned in a previous post I'm running kernel 2.6.32-5-686 (squeeze/sid) with all updates and possible upgrades installed.

@unSpawn: thanks for the hint. I'll check what's enabled on Monday and post here. Next step would be to install all missing updates/upgrades on an exact copy of a server, reconfigure the kernel as you indicated and see where that takes me.

Again, thank's a lot for your time and hints.

Kind regards,

Eric

EricTRA · 08-20-2010, 03:25 AM

Hi,

I upgraded the missing packages after lots of time (more then one update/upgrade went by) and now it's all working like a charm for two weeks now. I'll keep my fingers crossed and hope for the best. Thanks for the help guys.

Kind regards,

Eric