I'm having a strange problem on some of our Debian servers. It all started about three weeks ago when we moved our virtual environment (VMWare ESX3) from a SAN to a NAS (NetApp). At first I thought it had to do with that move but since the other 9 servers are working perfectly I eliminated that idea.
For over a year all 12 Debian 5 servers have been working great without mentionable failures. All servers are (where) up to date with the latest patches.
About three weeks ago I started having kernel panics with the following message on three of our servers:
Code: Bad EIP value
EIP  0x0 SS:ESP 0068:f6d7da18
Kernel panic - not syncing: Fatal exception in interrupt
and other times it looks like just a dump of hexadecimal data.
The only difference between those 3 servers is that they have several mounted shares connecting to the NAS using CIFS. So I was thinking that it might have to do with an update of some kind in regards to smb.
I recovered an image from a month ago, before the troubles began, copied over the data and MySQL databases and configured the 'old environment with recent data' exactly the same with MySQL master-master replication, document synchronization and load balancing. This task I performed last night (no other way since it's a production environment). Up to this time neither of the two 'restored' servers had a kernel panic. The one that has not been restored is having one at random about every hour and a half. Following are the different versions between the 'at this time' working server(s) and the failing one:
samba-common 2:3.2.5-4lenny12 2:3.2.5-4lenny9
smbclient 2:3.2.5-4lenny12 2:3.2.5-4lenny9
smbfs 2:3.2.5-4lenny12 2:3.2.5-4lenny9
Does anybody know if there is a reported bug in one of the recent versions of the abovementioned software packages?
Could someone help me out with this troubleshooting? I don't see any reports in the logs in regards to the kernel panic. When it happens the server just freezes mentioning the kernel panic and all I can do is restart. At this moment it only happens on my Nagios server (which also serves as a mail relay server).
While typing this message that one server crashed again. I'm attaching screenshots of two panics, one from yesterday and the other one from two minutes ago.
I've setup a test server with Debian Squeeze to see if those version cause the same problem. To test it I have all the network mounts configured on that server too and am executing a script to 'browse' through the folders in a similar way that my Nagios scripts do. The Debian Squeeze server has following version of the same software:
I'm not sure that this (Samba related difference) is causing the kernel panic but I'v almost exluded any other option I can think of.
Any help is greatly appreciated.