RHEL3 server going in hang state

bkak · 01-18-2007, 03:12 AM

Problem definition:
We are facing server hang problem from past 3 months. We have analyzed all our services that we are executing, and the server logs in /var/log/ but couldn’t find the solution. We are manually rebooting the server to recover it from hung state.

Action taken:
We have analyzed all the system logs and application logs in all our servers but we haven’t found any fixed pattern of messages in system logs. We are taking memory dump by top command for every 15 minutes and we found sufficient memory left before server going into hang state.

System configuration:
Red Hat Enterprise Linux ES release 3 (Taroon)
Kernel: 2.4.21-40.EL
Postgres: 7.3.8-2
Redhat Cluster Manager: 1.2.28
RAM: 2GB
Server: HP ML 370 G3, DL 760 G2

Please let me know the scenario’s in which server gets into hung state and what we need to check for rectifying the server hang problem.

Thank you in advance

Lenard · 01-18-2007, 06:52 AM

Update the systems, for example; https://rhn.redhat.com/errata/RHSA-2006-0710.html

aarontoth · 01-18-2007, 09:10 AM

I think another good idea to do is setup a crash script. Make it run every 10 seconds or whatever you think is appropriate. Report all system status' i.e. df, top, netstat, connections, ps... etc. have the system send out the alerts via mail. This should help a bit more than just looking at the logs.

AA

nwilkens · 01-25-2007, 06:41 PM

Setup the diskdump-utils or netdump package to capture the system crash (if thats what happening). This will help you narrow down the problem.

Also, as suggested earlier and system update may also help.

nifran · 01-31-2007, 02:41 PM

Install the sysstat package so that you'll collect data on performance.

Default on the installation collects memory usage, cpu usage, disk io, swap usage, and a number of other statistics every 10 minutes. You can change this down to a 1 minute interval if needed in /etc/cron.d/sysstat.

After the server crashes, you can run:
sar -r # gets memory information
sar # gets CPU information (like in top)
sar -q # load average and run que sizes
sar -n DEV # network interface statistics
sar -b # io rates

Those should give you a very good picture of what your server was doing when it hung, as well as any trend leading up to it.

Other than that, we've experienced a lot of the same problems with some of our machines. It turned out that the running kernel wasn't certified for the processors that we were running on, and updating the kernel fixed our issues. Take a look at the release notes for the newer kernels to see if they have added support for your server, or processors.