[SOLVED] Server Freeze

dema0024 · 06-13-2018, 09:00 AM

I am working on two RHEL 4.8 servers that are running redundant software. Every few days one of the servers (whichever is active at the time it seems) will freeze. When I say freeze I mean completely, can't log in from the KVM, can't log in from SHH, and the only way to fix it is to hard reboot the server. There are no logs for the application, nor any logs in /var/log/messages from when the freeze begins until it is rebooted. You can see a gap of about 30 minutes in /var/log/messages, for example. Other than this, nothing in the application log or var/log/messages before or after this suggest any issue. Usually if something happens to one server the other should automatically take over as it has a floating IP. In the software logs of the one that takes over you can see it has slow and/or unresponsive pings to the other server so it takes over. Unable to communicate to the other server it goes from standby to standalone, and does not go to primary until the other system is rebooted. Unfortunately the web servers don't realise this and still attempt to connect users to the frozen server, so no one can log on until the frozen application server is restarted. This is obviously a big problem. Unfortunately, with no logs to work with it is very hard to diagnose the cause of this freeze. Looking online I see many suggestions of memory issues, but what are the chances that both servers have bad memory? My next step is to upgrade the firmware on both servers (both HP Proliant, I believe they are G8 servers), but if that doesn't work what should I try next? Is there any way to get any useful information from the system as to why it froze? If it matters the application runs an oracle database.

MensaWater · 06-13-2018, 01:51 PM

At a guess you had some runaway process eat up all of a resource and at that point there were no resources to let you login or even run a shutdown. The usual cause of this is something like open processes or files. If you have sar collections running you can look at past sar daily files with:
sar -f sar -f /var/log/saDD
Where DD is a day of the month (01, 11, 21, 31 or any number between).

Examining the output might show you where something was steadily growing. I once did the above just for one subsection:
sar -f /var/log/saDD -q
And saw that the number of open processes (plist-sz column) and realized it had steadily grown every 10 minutes until it had reached over 32000 (presumably 32 K but it never actually shows that number probably because when it hit it couldn't run
sar anymore).

This kind of unrestrained growth can occur if you set ulimit values to "unlimited"

dema0024 · 06-13-2018, 02:46 PM

thanks for the quick reply. I am not sure if the system has sar on it I can take a look. If it doesn't would the rpm be on the installation disc or would I have to find it online? As for ulimit I am not a developer but I do know we set ulimit -c unlimited, but this is just the size of core files. As far as I know we don't set any other ulimits.

dema0024 · 06-14-2018, 10:22 AM

So it turns out we also set ulimit -s for unlimited stack size. I think part of why we don't have useful logs of what was going on before the freeze is that our application was set to one of its lowest logging levels, so we are increasing the logging level in hopes of catching what causes it next time.

MensaWater · 06-14-2018, 10:37 AM

I don't have RHEL4, On a RHEL5 system I see sar is part of the sysstat package. If you run "rpm -ql sysstat" you can see if it is on your RHEL4 as well.

If you run "ulimit -a" you'll see all the limits for the user you're logged in as.

In RHEL5 and later the limits are set in /etc/limits.conf (in RHEL6 they also have /etc/limits.d/<files> for some specific limits, most notably nproc that will override limits.conf values. I *think* /etc/limits.conf was there in RHEL4.

Setting limits is not a "developer" task. It is a "system administrator" task.

Note that soft limits and hard limits can be set by the admin. If they are different then the user's profile can use ulimit to use any value beyond the soft limit up to the hard limit. Otherwise they'll use the soft limit by default.

You should really think about moving on to something newer (at least RHEL6). RHEL4 went EOL long ago. RHEL5 was EOL'd more than ago and was over 10 years old at that point. RHEL7 has been available for a couple of years and RHEL6 will be EOL'd within the next year or so.

MensaWater · 06-14-2018, 10:43 AM

Quote:

Originally Posted by dema0024

So it turns out we also set ulimit -s for unlimited stack size. I think part of why we don't have useful logs of what was going on before the freeze is that our application was set to one of its lowest logging levels, so we are increasing the logging level in hopes of catching what causes it next time.

If you're hitting a limit it may not have a chance to log the issue because of that limit as you originally surmised.

You posted this while I was posting a follow up. See what I wrote there.

Ideally you should NOT have users doing "unlimited" for most things because it causes issues like those you described. In my first post I suggested review of sar although that won't tell you when it exceeds (or even reaches) the limit but by looking at its time samples you can see where things were growing rapidly and assume that they did in fact hit the limit.

dema0024 · 06-14-2018, 10:44 AM

So I ran sar on the file of the day of the freeze with -q:

07:20:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
07:30:02 AM 0 224 2.41 1.49 0.98
07:40:02 AM 0 226 0.55 0.40 0.59
07:50:02 AM 0 227 0.39 0.24 0.39
08:00:01 AM 0 231 0.14 0.21 0.30
08:10:01 AM 1 227 0.60 0.47 0.35
08:20:01 AM 1 228 0.58 0.62 0.48
08:30:01 AM 1 228 0.10 0.51 0.51
Average: 1 231 1.68 1.70 1.68

08:47:59 AM LINUX RESTART

09:00:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
09:10:01 AM 0 245 3.49 3.03 1.96
09:20:02 AM 3 247 3.64 3.55 2.76
09:30:02 AM 0 245 3.18 3.30 3.02
09:40:02 AM 1 249 2.05 2.70 2.93
09:50:01 AM 1 249 2.34 2.56 2.75
10:00:01 AM 3 249 1.52 2.15 2.48
10:10:01 AM 1 257 2.06 1.70 2.02

and without -q:

08:00:02 PM CPU %user %nice %system %iowait %idle
08:10:02 PM all 9.05 0.01 4.83 0.46 85.65
08:20:01 PM all 8.70 0.01 4.66 0.39 86.25
08:30:01 PM all 8.70 0.01 4.70 0.39 86.20
08:40:01 PM all 8.73 0.01 4.70 0.41 86.15
08:50:02 PM all 8.44 0.01 4.61 0.36 86.59
09:00:01 PM all 8.43 0.01 4.70 0.60 86.26

I don't see any issues. We actually do use newer software now, but these are old 32-bit systems that are running RHEL 4.8 that we still need to support. Our new servers run either CentOS 6.5 or CentOS 7 depending on the software

MensaWater · 06-14-2018, 12:05 PM

Running "sar -A" for the day in question will show all the reports. If you run "man sar" you can see what each of the different reports represent. I was just using "-q" as an example. You might have filled the file table, the process table, memory structures. Looking through sar output for things that show continual growth (or reduction e.g. for free memory) is a good indicator of what resource may have been exhausted.

dema0024 · 06-14-2018, 12:51 PM

I have attached my sar -A log (in 3 parts). I looked through it and did not see anything ramping up or down much, most things look fairly constant. I didn't see anything that immediately stuck out to me as bad, but then again I don't understand what half of the entries are for. It also confuses me that some things seem to log all day, while others just end at 8:30 (the reboot happened at 8:46 I believe after the freeze), so why did some things not get logged after the reboot and freeze?

Does anyone have any idea if anything in these logs looks suspicious?

dema0024 · 07-19-2018, 03:34 PM

The freezes stopped breaking log files after the firmware update but the system was still be held up. The customer mentioned that the last couple times the freeze occurred is when he was cleaning up some files. Turns out the servers were freezing up because the customer was attempting to manually delete oracle fra backups to free disk space. The problem was he was doing this on both servers including the primary database while it was running. The disk space issue was archive log mode was on in oracle which takes up a tremendous amount of space so we turned this off and cleaned up the fra folder. This makes the oracle backups less up to date but in reality we never use them anyway (faster to export DB from secondary and import it than to restore a backup). I believe between the customer's own actions and the disk getting full accounted for all the unexpected fail-overs but just to be sure I am going to observe it for a week. If the issue does not come back I will come back and mark this thread as solved.

MensaWater · 07-20-2018, 07:21 AM

Thanks for updating.

So is this an Oracle RAC cluster you're talking about? You hadn't mentioned failover or FRA previously.

If it is RAC are you using OCFS filesystems? I'm assuming you're not using RAW devices since you indicate the customer was able to delete files.

You might want to think about setting up archive logs to backup independently and more often and reduce (in Oracle) the number of logs kept rather than shutting it off completely. DB recovery is easier with archive logs in the event of a crash.

dema0024 · 07-20-2018, 08:27 AM

I mentioned redundant software and that it was supposed to fail over (And was not doing so correctly) in my original post. Our software uses oracle as its underlying database. If oracle or any other process fails, the other server is supposed to take over. In the case of the freezing the failover was only partially happening (both servers thought they were stand alone, and the web server was therefore not failing over and clients could not log in). Now I know the times it froze it was likely due to someone attempting to delete fra data on the primary server. My guess is that it locks up the server because it is trying to access those files while they are being deleted. I also think it might freeze when the disk space is full as the database has nowhere to write anything. I believe these are they only causes of these unexpected freeses/failovers but I am not sure so I will wait several days to be sure the issue does not reoccur. I am now believing this was mostly user error.

dema0024 · 07-25-2018, 01:24 PM

The issue has reoccurred including the "freeze" of one of the servers (log did not record anything for the 20 minutes the system was frozen before a manual fail-over was performed). Checking crontab we noticed that the web servers, which don't really need to be backed up, both had our system backup script running on them, and that the database servers had both our quick and full backups scheduled, in addition to oracle backups, and that oracle and full backups were scheduled for the same time. We have turned off all backups on the web servers and the full backups on the DB servers (our quick backups are sufficient to rebuild the servers). So we will see if that resolves the issue. Will wait another week to see if it is resolved.

MadeInGermany · 07-26-2018, 11:00 AM

The nproc limit (in /etc/security/limits.conf) prevents against a "fork bomb"

Code:

* soft nproc 16384

The nproc limit is per user.

dema0024 · 08-03-2018, 12:15 PM

There have not been any more freezes since I turned off full backups (which are unnecessary and were scheduled at the same time as oracle backups) and it has been over a week so I will mark this as resolved. For reference here was the crontab on the servers when they were freezing:

0 1 * * * su - oracle /home/oracle/AmhsDbBackupDriver.sh
0 5 * * * /home/ubimex/bin/AmhsQuickBackupRoot.sh > /home/ubimex/logs/AmhsQuickBackup.log 2>&1
*/5 * * * * /root/AmhsForcePrinters.sh
0 1 * * * /home/ubimex/bin/AmhsSystemBackup.sh > /home/ubimex/logs/AmhsSystemBackup.log 2>&1
*/5 * * * * /home/ubimex/freememory.sh >>/home/ubimex/logs/freemem.log 2>&1
*/30 * * * * top -b -n1 >> /home/ubimex/logs/top.log 2>&1

you can see at 1am every day there was a full backup of our software (which includes oracle) and an oracle backup scheduled at the same time.

and this is the new crontab:

0 1 * * * su - oracle /home/oracle/AmhsDbBackupDriver.sh
0 5 * * * /home/ubimex/bin/AmhsQuickBackupRoot.sh > /home/ubimex/logs/AmhsQuickBackup.log 2>&1
*/5 * * * * /root/AmhsForcePrinters.sh
*/5 * * * * /home/ubimex/freememory.sh >>/home/ubimex/logs/freemem.log 2>&1
*/30 * * * * top -b -n1 >> /home/ubimex/logs/top.log 2>&1

Since making this change and having the customer not try deleting fra files on an active primary server there have been no more freezes, so I suspect these were the only things causing the freezes.