LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 07-28-2008, 01:19 PM   #1
retrovertigo
Member
 
Registered: Jul 2007
Distribution: Arch Linux
Posts: 36

Rep: Reputation: 15
Server down for approx 14 hours, can't find anything useful in logs


I'm getting weird behavior on one of our servers, and I haven't been able to track down the root cause. This problem happens on Sundays, but does not happen every Sunday. The server will become unresponsive for 12 hours or more, and there will be a gap in all of the server's logs that corresponds with the downtime. Looking at /var/log/syslog, it appears that the machine is rebooting itself several times.

Code:
root@bp-webportal:/var/log# egrep 'reboot|restart' syslog.0
Jul 27 06:25:58 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 06:47:03 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 13:23:14 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 13:23:20 bp-webportal /usr/sbin/cron[3857]: (CRON) INFO (Running @reboot jobs)
Jul 27 13:33:56 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 13:34:01 bp-webportal /usr/sbin/cron[3949]: (CRON) INFO (Running @reboot jobs)
Jul 27 14:01:32 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 14:01:38 bp-webportal /usr/sbin/cron[3955]: (CRON) INFO (Running @reboot jobs)
Jul 27 14:05:13 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 14:05:18 bp-webportal /usr/sbin/cron[3945]: (CRON) INFO (Running @reboot jobs)
Jul 27 14:18:15 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 27 14:19:06 bp-webportal /usr/sbin/cron[3987]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:34:35 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:34:41 bp-webportal /usr/sbin/cron[3881]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:40:16 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:40:21 bp-webportal /usr/sbin/cron[5541]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:43:02 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:43:07 bp-webportal /usr/sbin/cron[3953]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:45:15 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:45:20 bp-webportal /usr/sbin/cron[3976]: (CRON) INFO (Running @reboot jobs)
Jul 28 04:47:47 bp-webportal syslogd 1.4.1#20ubuntu4: restart.
Jul 28 04:47:52 bp-webportal /usr/sbin/cron[3957]: (CRON) INFO (Running @reboot jobs)
Notice the 14-hour gap there. This same gap exists in all the log files I checked. Something appears to be overloading the system and causing it to reboot, but I can't tell what. The timing of these outages seem to be erratic as well. One week the outage happened at 2:30 am, another time at approx 4:00 pm, and this time at around 2:20 pm. I've tried watching the server during the middle of the night to see if I could catch some script running wild with system resources, but I've never caught anything. Is there anything else I can be doing in order to investigate this issue?


The server is running Ubuntu Feisty, and serves up a webapp running on Apache2 w/ PostgreSQL 8.2.

Last edited by retrovertigo; 07-28-2008 at 01:38 PM.
 
Old 07-28-2008, 08:21 PM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Interesting problem, especially with reboots in such short succession... When did this start to happen? Can you trace back what changed at that time? Can you check hardware conditions? No overheating? No filesystem problems? Does the server run some HW/SW watchdog? Is the hardware identical to another server that runs OK? What automated processes cronjobs does the server run? Any apps that respond to logged strings? What monitoring do you run right now? Any chance of letting Atop, Dstat or Collectl run? Is the kernel and userland software identical to another server that runs OK? What services does this server provide exactly? What other processes does it run? Can you make the machine run syslog from init and make it log to a remote syslog server or like a serial console? Do the logs show other anomalies than this?
 
Old 07-30-2008, 09:13 AM   #3
retrovertigo
Member
 
Registered: Jul 2007
Distribution: Arch Linux
Posts: 36

Original Poster
Rep: Reputation: 15
It began happening a couple months ago. The last time it happened we feared power issues and moved the power to a separate UPS on a dedicated power line separate from where it was plugged in before. The problem went away for a couple weekends, so we thought this took care of it, however the problem has cropped up again now.

There have been no recent hardware changes, and servers with similar hardware have not experienced any downtime. Overheating was initially a concern as our server closet is on the small side, but we addressed that by powering off several unneeded internal servers on the weekend when we're not in business.

The weekly cron jobs do run every Sunday morning at 6:47am. There are three of them:

1) man-db (manpage DB update)
2) popularity-contest (http://popcon.ubuntu.com/)
3) sysklogd (syslog rotation)

None of these would appear to be the problem as one of the outages occurred at approx. 2am on a Sunday morning, several hours before the weekly cronjobs were set to run. The only self-created cronjob is a small Perl script I wrote to handle the daily, weekly, and monthly local backups of the PGSQL database.

Unfortunately no other servers have identical software configurations. The server is a web portal for our company, and serves up a Perl-based webapp. The only additional services running are apache2 and postgresql.

I do not have experience using those monitoring tools, can you recommend one of them and I'll study it more in-depth and leave it to run this Sunday?

Last edited by retrovertigo; 07-30-2008 at 09:15 AM.
 
Old 07-30-2008, 10:20 AM   #4
jantman
Member
 
Registered: Nov 2005
Location: New Jersey, USA
Distribution: SuSE
Posts: 492

Rep: Reputation: 31
Is there anything else running on an automated interval, like backups? How much disk space is used (is it possible that there's something like a log rotate that's sending it over the edge?)

If you're worried about environmentals or power, I'd plug it into a UPS that supports monitoring (like an APC SmartUPS) and have that monitored remotely every few minutes. For environmental, an real environment sensor, or one of the Dallas 1-wire sensors, will be able to give you a handle on temperature.

Most importantly, I can't begin to stress the importance of a good host/network monitoring system, such as Nagios or Zenoss, in tracking down the root cause of the problem. You'll be able to monitor nearly every aspect of your system (in this case, the most important seems to be hardware (SMART, as well as lmsensors/onboard temps), system load and uptime, RAID health (if applicable), and *wink* logged in users.

What type of hardware is this running? If it's Proliant or Sun, the management log should be able to give you helpful information.
 
Old 07-31-2008, 06:54 AM   #5
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
I think jantman has some excellent remarks there and you should go with those.

In short: 'atop' is like 'top' except I find it easier to save output, "step" through saved records, grep for terms, 'dstat' is like 'sar -A' and collectl I have no experience with but it's somewhat similar to Dstat. I'd use Atop plus Dstat but it depends on how you want to look for info. You could for instance make Dstat save to CSV then graph it looking for spikes and then "zoom in" on a certain period using 'atop -r savedlog -b start:time -e end:time'.

The "problem" is there's no indication of anything, so saving SAR info is just something of a hedge. Since you've got stability problems I'd first look at replacing the webserver (if possible) with a stable machine and *then* troubleshoot it.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding LDAP Server Logs / Application Logs in Linux arbignay Linux - Newbie 2 03-24-2008 09:54 AM
find command : finding files that are created within last 24 hours Fond_of_Opensource Linux - Newbie 1 11-06-2006 03:47 AM
my /var/log/syslog only logs the last 12 hours or so ALInux Linux - General 4 02-11-2006 10:53 AM
Deleting Files Older Than 2 Hours using Find? LinuxGeek Linux - Software 1 06-29-2005 06:10 AM
mysql down...can't find a fix after several hours :( I_AM Linux - General 1 03-02-2005 08:34 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 08:34 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration