LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 01-28-2011, 03:31 AM   #1
sneakyimp
Senior Member
 
Registered: Dec 2004
Posts: 1,056

Rep: Reputation: 78
server crashes nightly -- locating the culprit process


I'm trying to determine why a server is crashing every night. Looking at the output of top right around crash time (which, unfortunately, is never consistent), I see load averages from 3 to 6. What's puzzling me is the presence of some items in the process list which appear to be chewing resources.

In particular, the 'gzip' process (with no visible args) appears to be devouring enormous CPU resources. There's also a gtar process that's taking a very very long time.
Code:
top - 02:00:31 up 15:42,  1 user,  load average: 2.99, 2.26, 2.48
Tasks: 127 total,   3 running, 124 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.0%us,  1.8%sy,  2.5%ni, 23.3%id, 62.2%wa,  0.2%hi,  0.0%si,  0.0%st
Mem:   4046580k total,  4021396k used,    25184k free,    18272k buffers
Swap:  2104504k total,      120k used,  2104384k free,  3085548k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
28963 mysql     15   0  635m 352m 4452 S  6.3  8.9  21:17.16 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/server.opreum.com.pid --skip-external-locking
 8902 root      34  19  4156  728  352 R  5.0  0.0  11:41.62 gzip
  502 root      10  -5     0    0    0 D  0.3  0.0   0:38.76 [kjournald]
  258 root      10  -5     0    0    0 S  0.3  0.0   0:20.30 [kswapd0]
23733 root      34  18 51436 9952 1636 S  0.0  0.2   0:15.82 /scripts/cpbackup
 4571 nobody    18   0  417m  13m 6416 S  0.3  0.4   0:11.59 /usr/local/apache/bin/httpd -k start -DSSL
 4574 nobody    18   0  417m  14m 6416 S  0.0  0.4   0:11.56 /usr/local/apache/bin/httpd -k start -DSSL
 4578 nobody    22   0  417m  14m 6420 S  0.7  0.4   0:11.56 /usr/local/apache/bin/httpd -k start -DSSL
 4576 nobody    18   0  353m  14m 6420 S  0.3  0.4   0:11.53 /usr/local/apache/bin/httpd -k start -DSSL
 4579 nobody    23   0  417m  14m 6416 S  0.0  0.4   0:11.51 /usr/local/apache/bin/httpd -k start -DSSL
 8901 root      34  19 21024 1020  840 R  0.0  0.0   0:09.38 /bin/gtar pczf siteuser.tar.gz siteuser
 2613 root      15   0     0    0    0 D  0.0  0.0   0:04.16 [pdflush]
 4394 root      18   0  194m 133m 6424 S  0.0  3.4   0:03.40 /usr/sbin/clamd
 3237 named     21   0  161m 4672 1952 S  0.0  0.1   0:02.71 /usr/sbin/named -u named
What is up with the gzip?

The process /bin/gtar pczf siteuser.tar.gz siteuser appears to be running as root but I don't see it in the root-level crontab. Perhaps its forked by some backup script? How do I locate it the source of this process?


In this snapshot, notice the transient php scripts such as "/usr/bin/php /home/siteuser/public_html/client/vehicle_detail.php":
Code:
root@server [/home/siteuser]# top
top - 02:19:54 up 16:02,  1 user,  load average: 3.06, 2.86, 2.84
Mem:   4046580k total,  4017556k used,    29024k free,    15672k buffers
Swap:  2104504k total,      104k used,  2104400k free,  3143504k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8902 root      35  19  4156  728  352 R 39.8  0.0  25:18.24 gzip
14449 siteuser   17   0  132m  12m 7012 D  3.0  0.3   0:00.03 /usr/bin/php /home/siteuser/public_html/client/vehicle_detail.php
14450 siteuser   17   0     0    0    0 Z  3.0  0.0   0:00.03 [php] <defunct>
14451 siteuser   17   0     0    0    0 Z  3.0  0.0   0:00.03 [php] <defunct>
  502 root      10  -5     0    0    0 D  1.0  0.0   0:44.75 [kjournald]
 4574 nobody    18   0  419m  16m 6416 S  1.0  0.4   0:14.65 /usr/local/apache/bin/httpd -k start -DSSL
28963 mysql     15   0  636m 353m 4452 S  1.0  9.0  22:00.43 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/server.opreum.com.pid --skip-external-locking
I've always been under the impression that PHP runs as a module within the apache process. Is there some configuration that would cause PHP to run as its own process?

Any help would be much appreciated.
 
Old 01-28-2011, 04:47 AM   #2
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 17,213

Rep: Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539
Is it running into issues zipping up your logs? You can check by file date/time.

A few minutes with man find might allow you to see what is going on, e.g.
date (set date/time to 5 minutes after the crash)
find -atime -ctime [clever options]
date (set time to normal) :-D.

This should narrow your search.
 
Old 01-28-2011, 04:00 PM   #3
sneakyimp
Senior Member
 
Registered: Dec 2004
Posts: 1,056

Original Poster
Rep: Reputation: 78
Thanks for your response.

I do believe that the backup scripts are bringing this machine to its knees on a nightly basis -- numerous backup scripts zipping enormous amounts of data to the same hard drive that is running the OS and which contains the database, etc. The problem is that I can't find the source of these backups. I suspect they are cron jobs but haven't been able to locate them all or match up specific cron jobs with specific greedy processes.

Thanks for the man tip, I am somewhat familiar with the find command but am not sure exactly what you are proposing? As I mentioned, I need to find the ultimate source of these processes rather than their output files. Also, this is a *production* server which no doubt has some date-dependent functions so I'm reluctant to change the date on it. The only thing I can think to do is go eyeball all the cron jobs and look inside them for tar commands or gzip commands.
 
Old 01-28-2011, 09:53 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,314

Rep: Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172Reputation: 4172
Quote:
Originally Posted by sneakyimp View Post
-- numerous backup scripts zipping enormous amounts of data to the same hard drive that is running the OS and which contains the database, etc.
Seriously bad configuration.
As it's a server, do you have auditd running ? - it should be able to tell you. You can turn PPID on in top and see if shows anything useful; "ps" can export the same data.
If nothing else works set up a wrapper around the command(s) of interest and spit out a message with the caller.
 
Old 02-03-2011, 07:58 PM   #5
sneakyimp
Senior Member
 
Registered: Dec 2004
Posts: 1,056

Original Poster
Rep: Reputation: 78
Thanks for the responses here. It may have been a bad idea, but we sent the cron job issue back to the tech support crew at the hosting company as it was they who screwed it up in the first place. Looks like we'll be moving the server elsewhere eventually, so we're limping by in the meantime.

Syg00:
Although I don't know what it does, auditd does appear to be running:
Code:
root@server [~]# ps -aux | grep auditd
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root       560  0.0  0.0      0     0 ?        S<   Jan28   0:00 [kauditd]
root      3132  0.0  0.0  92888   928 ?        S<sl Jan28   0:10 auditd
root     31817  0.0  0.0  61160   716 pts/0    D+   18:55   0:00 grep auditd
Any tips on how to get info from it?

PPID? Do you mean process ids? I have turned on the command line which didn't look too useful for tracing the origin.

What do you mean by 'set up a wrapper around the commands of interest' ?
 
Old 02-03-2011, 08:40 PM   #6
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,196

Rep: Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044
A loaded server becomes slow, but does not crash. Not Linux

IF the archive processes crash the server, it could be because of lack of disk space. Are you low on free space?

You have to start somewhere. Does the machine crash at exactly the same time? You could start a top command in batch mode, running every second and pipe the output into a file. Once the server crashes, the top command has stopped as well and you can make a post-mortem dump and see if there were processes using excessive resources.

jlinkels
 
Old 02-04-2011, 03:42 AM   #7
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 17,213

Rep: Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539Reputation: 2539
Sorry for being unclear. I was suggesting setting the time to just after the crash, and then running a find -atime & find -ctime just after crash time so you could see the very recent files accessed & created. But as you say, if it's online, setting time is unwise
 
Old 02-04-2011, 09:56 AM   #8
sneakyimp
Senior Member
 
Registered: Dec 2004
Posts: 1,056

Original Poster
Rep: Reputation: 78
Quote:
Originally Posted by jlinkels View Post
A loaded server becomes slow, but does not crash. Not Linux

IF the archive processes crash the server, it could be because of lack of disk space. Are you low on free space?

You have to start somewhere. Does the machine crash at exactly the same time? You could start a top command in batch mode, running every second and pipe the output into a file. Once the server crashes, the top command has stopped as well and you can make a post-mortem dump and see if there were processes using excessive resources.

jlinkels
A well-configured Linux server doesn't crash, but one configured by a hack may crash under adverse circumstances

The hard drive is about 50% full. That should be enough to keep trudging along. I've instructed them to get another hard drive which, amazingly, is going to require a different machine due to lack of hard drive space in the chassis. Amazing.

The machine does not always crash at exactly the same time but closely enough (early AM) that we suspect the backup processes. I learned that there were a number of different (paranoid) backups that were trying to GZIP dozens of GB of images from one place on the hard drive to another for some inexplicable reason. It was overwhelming both the CPU and the hard drive for hours at a time. I have tried my best to put an end to that backup nonsense and the server has now been up about a week with no crashes AFAIK.

If the problems continue, the top output in batch mode sounds pretty good. I wish there was some way to locate the origin point of a given process though (e.g, launched from a cron job, an apache process, etc). That would make life so much easier.
 
Old 02-04-2011, 09:58 AM   #9
sneakyimp
Senior Member
 
Registered: Dec 2004
Posts: 1,056

Original Poster
Rep: Reputation: 78
Quote:
Originally Posted by business_kid View Post
Sorry for being unclear. I was suggesting setting the time to just after the crash, and then running a find -atime & find -ctime just after crash time so you could see the very recent files accessed & created. But as you say, if it's online, setting time is unwise
I appreciate your suggestion, but yes the server is a production server. The ill-advised backup procedures were not initially a problem because the site had few files. As image files have been uploaded, it has grown too large so the backup processes have become onerous.

Also, knowing what the output files are is not nearly as helpful as knowing their provenance! I want to know who spawns the processes that are chewing up resources.
 
Old 02-04-2011, 10:17 AM   #10
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,196

Rep: Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044
Quote:
Originally Posted by sneakyimp View Post
If the problems continue, the top output in batch mode sounds pretty good. I wish there was some way to locate the origin point of a given process though (e.g, launched from a cron job, an apache process, etc). That would make life so much easier.
You can add the 'b' column to top as to display the parent process. You can write a simple script to call ps aux every second and pipe that output into a file as well. You could even grep for the gzip processes, get the pid, and do a cat /proc/<pid>.

I also remember now there is a command called pstree which shows the complete process tree. Maybe that sheds some light.

jlinkels
 
Old 02-04-2011, 11:21 AM   #11
sneakyimp
Senior Member
 
Registered: Dec 2004
Posts: 1,056

Original Poster
Rep: Reputation: 78
Thanks for the tips, jlinkels. I'll be tinkering with those commands when I get a chance.

BTW, did you ever get the function buttons or the bluetooth running on your wife's eeePC?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Process crashes and does not dump core traene Programming 5 03-09-2024 04:28 AM
FC12 server disable nightly yum update createch Linux - Server 2 03-23-2010 09:02 AM
Watching a process, and re-running it if it crashes milescook Linux - General 1 08-19-2006 04:53 PM
help downloading nightly mysql db backup from one server to another... balzack Linux - Software 1 02-20-2006 03:08 AM
cron nightly backup to ftp server rkane Linux - Networking 2 03-04-2004 06:05 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 06:41 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration