Red Hat ES 3 server crashed???

ginda · 04-20-2007, 04:08 AM

Our red hat es 3 file & print server keeps crashing these past 2 days, you can ping it and when i scan teh ports it shows ssh and others but i cannot ssh into it. I have tried looking in the logs but cant really find any obvious issues, this has now happened in the past two days.
The only weird thing ive noticed is that in the messages logs there were no activity from

april 18th 07 - 17:56 TO april 19th 07 - 08:21
AND
april 19th 07 - 08:21 TO april 19th 07 - 09:21

the above shows a one hour gap before we physically rebooted the server at about 9:20 something.

The same issue happended today

april 19th 07 - 19:24 TO april 20th 07 - 08:21
AND
april 20th 07 - 08:21 TO april 20th 07 - 09:21

Again the above showing a one hour gap before we reboot the server, please see examples from the messages logs.

Apr 19 19:24:34 dud2wsfs01 smbd[7517]: [2007/04/19 19:24:34, 0] lib/util_sock.c:write_socket(455)
Apr 19 19:24:34 dud2wsfs01 smbd[7517]: write_socket: Error writing 4 bytes to socket 22: ERRNO = Connection reset by peer
Apr 19 19:24:34 dud2wsfs01 smbd[7517]: [2007/04/19 19:24:34, 0] lib/util_sock.c:send_smb(647)
Apr 19 19:24:34 dud2wsfs01 smbd[7517]: Error writing 4 bytes to client. -1. (Connection reset by peer)
Apr 20 08:21:06 dud2wsfs01 syslogd 1.4.1: restart.
Apr 20 08:21:06 dud2wsfs01 syslog: syslogd startup succeeded
Apr 20 08:21:06 dud2wsfs01 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Apr 20 08:21:06 dud2wsfs01 kernel: Linux version 2.4.21-4.ELsmp (bhcompile@daffy.perf.redhat.com) (gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-20)) #1 SMP Fri Oct 3 17:52:56 EDT 2003
Apr 20 08:21:06 dud2wsfs01 kernel: BIOS-provided physical RAM map:
Apr 20 08:21:06 dud2wsfs01 kernel: BIOS-e820: 0000000000000000 - 00000000000a0000 (usable)
Apr 20 08:21:06 dud2wsfs01 kernel: BIOS-e820: 0000000000100000 - 000000003ffc0000 (usable)

Apr 20 08:21:14 dud2wsfs01 kernel: RAMDISK: Compressed image found at block 0
Apr 20 08:21:14 dud2wsfs01 kernel: Freeing initrd memory: 325k freed
Apr 20 08:21:15 dud2wsfs01 kernel: VFS: Mounted root (ext2 filesystem).
Apr 20 09:21:14 dud2wsfs01 ntpdate[3266]: step time server 172.16.200.240 offset 3599.962831 sec
Apr 20 09:21:15 dud2wsfs01 kernel: SCSI subsystem driver Revision: 1.00
Apr 20 09:21:15 dud2wsfs01 ntpd: succeeded
Apr 20 09:21:15 dud2wsfs01 kernel: scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
Apr 20 09:21:15 dud2wsfs01 kernel: <Adaptec 3960D Ultra160 SCSI adapter>
Apr 20 09:21:15 dud2wsfs01 ntpd[3270]: ntpd 4.1.2@1.892 Tue Feb 24 06:32:25 EST 2004 (1)
Apr 20 09:21:15 dud2wsfs01 ntpd: ntpd startup succeeded

If you dont see anything unusual, could you recommend a monitoring tools? to try and identify possible causes.

Thanks in advance.

MensaWater · 04-20-2007, 10:45 AM

If it happened at 8:21 each day that sounds like a timed thing which should immediately lead you to investigate scheduling tools. The basic one is cron. Have a look at /var/log/cron to see what may have kicked off at 8:21 (or more likely 8:20) or sometime before that (e.g. 8:00).

You can look in /var/spool/cron to see what the files there (if any) may be kicking off.

Also you can look in the files in /etc/cron* to see what they may be kicking off.

Another utility some people use is anacron - you'd have to check the man page for that for what files it uses as I don't use it.

Finally there are of course commercial scheduling tools (and probably other open source ones) like Tivoli Workload Manager that could be scheduling things. I don't know that one has a Linux agent but it is fairly popular (was called Maestro) previously in large UNIX shops.

ginda · 04-24-2007, 04:32 AM

Its happended again, the server has frozen up, i have identified that the 8.21 gap to 9.21 was a ntp date update running at boot up.

I have looked through the messages logs but cant find any obvious cause of the system hang. Can someone advise what i could do to identify what is causing the hang.

Thanks

MensaWater · 04-24-2007, 09:24 AM

If everything is up but you can't login one culprit might be NFS (or possibly Samba) mounts. When you login there is an attempt to check quotas on all drives (even if you haven't set any quotas). If you have mounted an NFS share (or maybe Samba) but the server that is the source of the mount is down the filesystem is inaccessible but your mnttab still indicates it is mounted. You therefore would see long delays while it tried to check quota. This might eventually let you in after a timeout.

Check to see if you have any mounts to this server from others via NFS or Samba. If so see if something is happening to those servers (e.g. daily reboot).

ginda · 04-25-2007, 04:53 AM

Hi

This server is used by about 17 users for samba shares. I am currentley keeping a couple of ssh sessions open to the server monitoring cpu, ram, unix users and tailing logs and processes to hopefully see the problem causer when it crashes the system.

MensaWater · 04-25-2007, 07:50 AM

Samba shares.

Meaning it is a Samba server on which the filesystems exist natively and others mount to their systems OR meaning it is a Samba client that has filesystems mounted from other systems? If the latter then what I said about quotas may be the issue.

djjoshuad · 04-25-2007, 12:47 PM

it's fairly safe to say that when a *nix system becomes completely unresponsive, it's usually a bad piece of critical hardware. If it truly is unresponsive, I'd look into the cpu and memory, maybe try some of the many memory stepping programs that will analyze that for you (memtest86 is a good one). It could be something as extraneous as a NIC or USB controller that could be replaced or simply turned off, but usually if those things are locking the system, you know about it before it ever gets a chance to come all the way up.

Before you go down any real investigative roads, I advise getting as much information about your problem as you can. Are you 100% sure that the system is completely unresponsive? Have you tried connecting to the console and checking it out from there? Are there any other symptoms that might give you some clues as to the general health of the server?

ginda · 04-26-2007, 09:55 AM

Hi

It is a sambar and cups server. The server has crashed a few times now and i have always tried to ssh into it with no luck, and to be sure i have also pinged and port scanned the server and it showed it was up with the correct ports open.

I have then travelled to the physical location of the serverto see if i can log in on the main console, but that has also crashed.