Server down for approx 14 hours, can't find anything useful in logs

retrovertigo · 07-28-2008, 01:19 PM

I'm getting weird behavior on one of our servers, and I haven't been able to track down the root cause. This problem happens on Sundays, but does not happen every Sunday. The server will become unresponsive for 12 hours or more, and there will be a gap in all of the server's logs that corresponds with the downtime. Looking at /var/log/syslog, it appears that the machine is rebooting itself several times.

Code:

root@bp-webportal:/var/log# egrep 'reboot|restart' syslog.0
Jul 27 06:25:58 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 06:47:03 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 13:23:14 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 13:23:20 bp-webportal /usr/sbin/cron[3857]: (CRON) INFO (Running @reboot jobs)
Jul 27 13:33:56 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 13:34:01 bp-webportal /usr/sbin/cron[3949]: (CRON) INFO (Running @reboot jobs)
Jul 27 14:01:32 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 14:01:38 bp-webportal /usr/sbin/cron[3955]: (CRON) INFO (Running @reboot jobs)
Jul 27 14:05:13 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 14:05:18 bp-webportal /usr/sbin/cron[3945]: (CRON) INFO (Running @reboot jobs)
Jul 27 14:18:15 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 14:19:06 bp-webportal /usr/sbin/cron[3987]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:34:35 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:34:41 bp-webportal /usr/sbin/cron[3881]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:40:16 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:40:21 bp-webportal /usr/sbin/cron[5541]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:43:02 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:43:07 bp-webportal /usr/sbin/cron[3953]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:45:15 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:45:20 bp-webportal /usr/sbin/cron[3976]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:47:47 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:47:52 bp-webportal /usr/sbin/cron[3957]: (CRON) INFO (Running @reboot jobs)

Notice the 14-hour gap there. This same gap exists in all the log files I checked. Something appears to be overloading the system and causing it to reboot, but I can't tell what. The timing of these outages seem to be erratic as well. One week the outage happened at 2:30 am, another time at approx 4:00 pm, and this time at around 2:20 pm. I've tried watching the server during the middle of the night to see if I could catch some script running wild with system resources, but I've never caught anything. Is there anything else I can be doing in order to investigate this issue?

The server is running Ubuntu Feisty, and serves up a webapp running on Apache2 w/ PostgreSQL 8.2.

unSpawn · 07-28-2008, 08:21 PM

Interesting problem, especially with reboots in such short succession... When did this start to happen? Can you trace back what changed at that time? Can you check hardware conditions? No overheating? No filesystem problems? Does the server run some HW/SW watchdog? Is the hardware identical to another server that runs OK? What automated processes cronjobs does the server run? Any apps that respond to logged strings? What monitoring do you run right now? Any chance of letting Atop, Dstat or Collectl run? Is the kernel and userland software identical to another server that runs OK? What services does this server provide exactly? What other processes does it run? Can you make the machine run syslog from init and make it log to a remote syslog server or like a serial console? Do the logs show other anomalies than this?

retrovertigo · 07-30-2008, 09:13 AM

It began happening a couple months ago. The last time it happened we feared power issues and moved the power to a separate UPS on a dedicated power line separate from where it was plugged in before. The problem went away for a couple weekends, so we thought this took care of it, however the problem has cropped up again now.

There have been no recent hardware changes, and servers with similar hardware have not experienced any downtime. Overheating was initially a concern as our server closet is on the small side, but we addressed that by powering off several unneeded internal servers on the weekend when we're not in business.

The weekly cron jobs do run every Sunday morning at 6:47am. There are three of them:

1) man-db (manpage DB update)
2) popularity-contest (http://popcon.ubuntu.com/)
3) sysklogd (syslog rotation)

None of these would appear to be the problem as one of the outages occurred at approx. 2am on a Sunday morning, several hours before the weekly cronjobs were set to run. The only self-created cronjob is a small Perl script I wrote to handle the daily, weekly, and monthly local backups of the PGSQL database.

Unfortunately no other servers have identical software configurations. The server is a web portal for our company, and serves up a Perl-based webapp. The only additional services running are apache2 and postgresql.

I do not have experience using those monitoring tools, can you recommend one of them and I'll study it more in-depth and leave it to run this Sunday?

jantman · 07-30-2008, 10:20 AM

Is there anything else running on an automated interval, like backups? How much disk space is used (is it possible that there's something like a log rotate that's sending it over the edge?)

If you're worried about environmentals or power, I'd plug it into a UPS that supports monitoring (like an APC SmartUPS) and have that monitored remotely every few minutes. For environmental, an real environment sensor, or one of the Dallas 1-wire sensors, will be able to give you a handle on temperature.

Most importantly, I can't begin to stress the importance of a good host/network monitoring system, such as Nagios or Zenoss, in tracking down the root cause of the problem. You'll be able to monitor nearly every aspect of your system (in this case, the most important seems to be hardware (SMART, as well as lmsensors/onboard temps), system load and uptime, RAID health (if applicable), and *wink* logged in users.

What type of hardware is this running? If it's Proliant or Sun, the management log should be able to give you helpful information.

unSpawn · 07-31-2008, 06:54 AM

I think jantman has some excellent remarks there and you should go with those.

In short: 'atop' is like 'top' except I find it easier to save output, "step" through saved records, grep for terms, 'dstat' is like 'sar -A' and collectl I have no experience with but it's somewhat similar to Dstat. I'd use Atop plus Dstat but it depends on how you want to look for info. You could for instance make Dstat save to CSV then graph it looking for spikes and then "zoom in" on a certain period using 'atop -r savedlog -b start:time -e end:time'.

The "problem" is there's no indication of anything, so saving SAR info is just something of a hedge. Since you've got stability problems I'd first look at replacing the webserver (if possible) with a stable machine and *then* troubleshoot it.