LinuxQuestions.org
Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 06-13-2018, 09:00 AM   #1
dema0024
LQ Newbie
 
Registered: Jun 2018
Posts: 5

Rep: Reputation: Disabled
Server Freeze


I am working on two RHEL 4.8 servers that are running redundant software. Every few days one of the servers (whichever is active at the time it seems) will freeze. When I say freeze I mean completely, can't log in from the KVM, can't log in from SHH, and the only way to fix it is to hard reboot the server. There are no logs for the application, nor any logs in /var/log/messages from when the freeze begins until it is rebooted. You can see a gap of about 30 minutes in /var/log/messages, for example. Other than this, nothing in the application log or var/log/messages before or after this suggest any issue. Usually if something happens to one server the other should automatically take over as it has a floating IP. In the software logs of the one that takes over you can see it has slow and/or unresponsive pings to the other server so it takes over. Unable to communicate to the other server it goes from standby to standalone, and does not go to primary until the other system is rebooted. Unfortunately the web servers don't realise this and still attempt to connect users to the frozen server, so no one can log on until the frozen application server is restarted. This is obviously a big problem. Unfortunately, with no logs to work with it is very hard to diagnose the cause of this freeze. Looking online I see many suggestions of memory issues, but what are the chances that both servers have bad memory? My next step is to upgrade the firmware on both servers (both HP Proliant, I believe they are G8 servers), but if that doesn't work what should I try next? Is there any way to get any useful information from the system as to why it froze? If it matters the application runs an oracle database.
 
Old 06-13-2018, 01:51 PM   #2
MensaWater
LQ Guru
 
Registered: May 2005
Location: Atlanta Georgia USA
Distribution: Redhat (RHEL), CentOS, Fedora, CoreOS, Debian, FreeBSD, HP-UX, Solaris, SCO
Posts: 7,406
Blog Entries: 15

Rep: Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424
At a guess you had some runaway process eat up all of a resource and at that point there were no resources to let you login or even run a shutdown. The usual cause of this is something like open processes or files. If you have sar collections running you can look at past sar daily files with:
sar -f sar -f /var/log/saDD
Where DD is a day of the month (01, 11, 21, 31 or any number between).

Examining the output might show you where something was steadily growing. I once did the above just for one subsection:
sar -f /var/log/saDD -q
And saw that the number of open processes (plist-sz column) and realized it had steadily grown every 10 minutes until it had reached over 32000 (presumably 32 K but it never actually shows that number probably because when it hit it couldn't run
sar anymore).

This kind of unrestrained growth can occur if you set ulimit values to "unlimited"
 
Old 06-13-2018, 02:46 PM   #3
dema0024
LQ Newbie
 
Registered: Jun 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
thanks for the quick reply. I am not sure if the system has sar on it I can take a look. If it doesn't would the rpm be on the installation disc or would I have to find it online? As for ulimit I am not a developer but I do know we set ulimit -c unlimited, but this is just the size of core files. As far as I know we don't set any other ulimits.
 
Old 06-14-2018, 10:22 AM   #4
dema0024
LQ Newbie
 
Registered: Jun 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
So it turns out we also set ulimit -s for unlimited stack size. I think part of why we don't have useful logs of what was going on before the freeze is that our application was set to one of its lowest logging levels, so we are increasing the logging level in hopes of catching what causes it next time.
 
Old 06-14-2018, 10:37 AM   #5
MensaWater
LQ Guru
 
Registered: May 2005
Location: Atlanta Georgia USA
Distribution: Redhat (RHEL), CentOS, Fedora, CoreOS, Debian, FreeBSD, HP-UX, Solaris, SCO
Posts: 7,406
Blog Entries: 15

Rep: Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424
I don't have RHEL4, On a RHEL5 system I see sar is part of the sysstat package. If you run "rpm -ql sysstat" you can see if it is on your RHEL4 as well.

If you run "ulimit -a" you'll see all the limits for the user you're logged in as.

In RHEL5 and later the limits are set in /etc/limits.conf (in RHEL6 they also have /etc/limits.d/<files> for some specific limits, most notably nproc that will override limits.conf values. I *think* /etc/limits.conf was there in RHEL4.

Setting limits is not a "developer" task. It is a "system administrator" task.

Note that soft limits and hard limits can be set by the admin. If they are different then the user's profile can use ulimit to use any value beyond the soft limit up to the hard limit. Otherwise they'll use the soft limit by default.

You should really think about moving on to something newer (at least RHEL6). RHEL4 went EOL long ago. RHEL5 was EOL'd more than ago and was over 10 years old at that point. RHEL7 has been available for a couple of years and RHEL6 will be EOL'd within the next year or so.
 
Old 06-14-2018, 10:43 AM   #6
MensaWater
LQ Guru
 
Registered: May 2005
Location: Atlanta Georgia USA
Distribution: Redhat (RHEL), CentOS, Fedora, CoreOS, Debian, FreeBSD, HP-UX, Solaris, SCO
Posts: 7,406
Blog Entries: 15

Rep: Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424
Quote:
Originally Posted by dema0024 View Post
So it turns out we also set ulimit -s for unlimited stack size. I think part of why we don't have useful logs of what was going on before the freeze is that our application was set to one of its lowest logging levels, so we are increasing the logging level in hopes of catching what causes it next time.
If you're hitting a limit it may not have a chance to log the issue because of that limit as you originally surmised.

You posted this while I was posting a follow up. See what I wrote there.

Ideally you should NOT have users doing "unlimited" for most things because it causes issues like those you described. In my first post I suggested review of sar although that won't tell you when it exceeds (or even reaches) the limit but by looking at its time samples you can see where things were growing rapidly and assume that they did in fact hit the limit.
 
Old 06-14-2018, 10:44 AM   #7
dema0024
LQ Newbie
 
Registered: Jun 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
So I ran sar on the file of the day of the freeze with -q:

07:20:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
07:30:02 AM 0 224 2.41 1.49 0.98
07:40:02 AM 0 226 0.55 0.40 0.59
07:50:02 AM 0 227 0.39 0.24 0.39
08:00:01 AM 0 231 0.14 0.21 0.30
08:10:01 AM 1 227 0.60 0.47 0.35
08:20:01 AM 1 228 0.58 0.62 0.48
08:30:01 AM 1 228 0.10 0.51 0.51
Average: 1 231 1.68 1.70 1.68

08:47:59 AM LINUX RESTART

09:00:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
09:10:01 AM 0 245 3.49 3.03 1.96
09:20:02 AM 3 247 3.64 3.55 2.76
09:30:02 AM 0 245 3.18 3.30 3.02
09:40:02 AM 1 249 2.05 2.70 2.93
09:50:01 AM 1 249 2.34 2.56 2.75
10:00:01 AM 3 249 1.52 2.15 2.48
10:10:01 AM 1 257 2.06 1.70 2.02

and without -q:


08:00:02 PM CPU %user %nice %system %iowait %idle
08:10:02 PM all 9.05 0.01 4.83 0.46 85.65
08:20:01 PM all 8.70 0.01 4.66 0.39 86.25
08:30:01 PM all 8.70 0.01 4.70 0.39 86.20
08:40:01 PM all 8.73 0.01 4.70 0.41 86.15
08:50:02 PM all 8.44 0.01 4.61 0.36 86.59
09:00:01 PM all 8.43 0.01 4.70 0.60 86.26

I don't see any issues. We actually do use newer software now, but these are old 32-bit systems that are running RHEL 4.8 that we still need to support. Our new servers run either CentOS 6.5 or CentOS 7 depending on the software
 
Old 06-14-2018, 12:05 PM   #8
MensaWater
LQ Guru
 
Registered: May 2005
Location: Atlanta Georgia USA
Distribution: Redhat (RHEL), CentOS, Fedora, CoreOS, Debian, FreeBSD, HP-UX, Solaris, SCO
Posts: 7,406
Blog Entries: 15

Rep: Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424Reputation: 1424
Running "sar -A" for the day in question will show all the reports. If you run "man sar" you can see what each of the different reports represent. I was just using "-q" as an example. You might have filled the file table, the process table, memory structures. Looking through sar output for things that show continual growth (or reduction e.g. for free memory) is a good indicator of what resource may have been exhausted.
 
Old 06-14-2018, 12:51 PM   #9
dema0024
LQ Newbie
 
Registered: Jun 2018
Posts: 5

Original Poster
Rep: Reputation: Disabled
I have attached my sar -A log (in 3 parts). I looked through it and did not see anything ramping up or down much, most things look fairly constant. I didn't see anything that immediately stuck out to me as bad, but then again I don't understand what half of the entries are for. It also confuses me that some things seem to log all day, while others just end at 8:30 (the reboot happened at 8:46 I believe after the freeze), so why did some things not get logged after the reboot and freeze?

Does anyone have any idea if anything in these logs looks suspicious?
Attached Files
File Type: log sarPart1.log (241.6 KB, 2 views)
File Type: log sarPart2.log (237.0 KB, 1 views)
File Type: log sarPart3.log (236.3 KB, 2 views)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Redhat frequent server freeze physnastr Linux - Server 2 01-10-2011 07:04 PM
Occasional server freeze error : Need Help Sarolearthy Linux - Software 1 07-26-2007 11:54 AM
Server Freeze . . . . just_me_then Linux - Server 1 02-12-2007 08:57 PM
Sendmail on FC6 causes server freeze? ErikSchoute Fedora 12 02-12-2007 08:43 AM
Redhat 8.0 Server Freeze usercsr Red Hat 0 04-17-2004 12:47 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 02:52 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration