Linux - Server This forum is for the discussion of Linux Software used in a server related context. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
01-28-2011, 03:31 AM
|
#1
|
Senior Member
Registered: Dec 2004
Posts: 1,056
Rep:
|
server crashes nightly -- locating the culprit process
I'm trying to determine why a server is crashing every night. Looking at the output of top right around crash time (which, unfortunately, is never consistent), I see load averages from 3 to 6. What's puzzling me is the presence of some items in the process list which appear to be chewing resources.
In particular, the 'gzip' process (with no visible args) appears to be devouring enormous CPU resources. There's also a gtar process that's taking a very very long time.
Code:
top - 02:00:31 up 15:42, 1 user, load average: 2.99, 2.26, 2.48
Tasks: 127 total, 3 running, 124 sleeping, 0 stopped, 0 zombie
Cpu(s): 10.0%us, 1.8%sy, 2.5%ni, 23.3%id, 62.2%wa, 0.2%hi, 0.0%si, 0.0%st
Mem: 4046580k total, 4021396k used, 25184k free, 18272k buffers
Swap: 2104504k total, 120k used, 2104384k free, 3085548k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28963 mysql 15 0 635m 352m 4452 S 6.3 8.9 21:17.16 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/server.opreum.com.pid --skip-external-locking
8902 root 34 19 4156 728 352 R 5.0 0.0 11:41.62 gzip
502 root 10 -5 0 0 0 D 0.3 0.0 0:38.76 [kjournald]
258 root 10 -5 0 0 0 S 0.3 0.0 0:20.30 [kswapd0]
23733 root 34 18 51436 9952 1636 S 0.0 0.2 0:15.82 /scripts/cpbackup
4571 nobody 18 0 417m 13m 6416 S 0.3 0.4 0:11.59 /usr/local/apache/bin/httpd -k start -DSSL
4574 nobody 18 0 417m 14m 6416 S 0.0 0.4 0:11.56 /usr/local/apache/bin/httpd -k start -DSSL
4578 nobody 22 0 417m 14m 6420 S 0.7 0.4 0:11.56 /usr/local/apache/bin/httpd -k start -DSSL
4576 nobody 18 0 353m 14m 6420 S 0.3 0.4 0:11.53 /usr/local/apache/bin/httpd -k start -DSSL
4579 nobody 23 0 417m 14m 6416 S 0.0 0.4 0:11.51 /usr/local/apache/bin/httpd -k start -DSSL
8901 root 34 19 21024 1020 840 R 0.0 0.0 0:09.38 /bin/gtar pczf siteuser.tar.gz siteuser
2613 root 15 0 0 0 0 D 0.0 0.0 0:04.16 [pdflush]
4394 root 18 0 194m 133m 6424 S 0.0 3.4 0:03.40 /usr/sbin/clamd
3237 named 21 0 161m 4672 1952 S 0.0 0.1 0:02.71 /usr/sbin/named -u named
What is up with the gzip?
The process /bin/gtar pczf siteuser.tar.gz siteuser appears to be running as root but I don't see it in the root-level crontab. Perhaps its forked by some backup script? How do I locate it the source of this process?
In this snapshot, notice the transient php scripts such as "/usr/bin/php /home/siteuser/public_html/client/vehicle_detail.php":
Code:
root@server [/home/siteuser]# top
top - 02:19:54 up 16:02, 1 user, load average: 3.06, 2.86, 2.84
Mem: 4046580k total, 4017556k used, 29024k free, 15672k buffers
Swap: 2104504k total, 104k used, 2104400k free, 3143504k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8902 root 35 19 4156 728 352 R 39.8 0.0 25:18.24 gzip
14449 siteuser 17 0 132m 12m 7012 D 3.0 0.3 0:00.03 /usr/bin/php /home/siteuser/public_html/client/vehicle_detail.php
14450 siteuser 17 0 0 0 0 Z 3.0 0.0 0:00.03 [php] <defunct>
14451 siteuser 17 0 0 0 0 Z 3.0 0.0 0:00.03 [php] <defunct>
502 root 10 -5 0 0 0 D 1.0 0.0 0:44.75 [kjournald]
4574 nobody 18 0 419m 16m 6416 S 1.0 0.4 0:14.65 /usr/local/apache/bin/httpd -k start -DSSL
28963 mysql 15 0 636m 353m 4452 S 1.0 9.0 22:00.43 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=mysql --pid-file=/var/lib/mysql/server.opreum.com.pid --skip-external-locking
I've always been under the impression that PHP runs as a module within the apache process. Is there some configuration that would cause PHP to run as its own process?
Any help would be much appreciated.
|
|
|
01-28-2011, 04:47 AM
|
#2
|
LQ Guru
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 17,213
|
Is it running into issues zipping up your logs? You can check by file date/time.
A few minutes with man find might allow you to see what is going on, e.g.
date (set date/time to 5 minutes after the crash)
find -atime -ctime [clever options]
date (set time to normal) :-D.
This should narrow your search.
|
|
|
01-28-2011, 04:00 PM
|
#3
|
Senior Member
Registered: Dec 2004
Posts: 1,056
Original Poster
Rep:
|
Thanks for your response.
I do believe that the backup scripts are bringing this machine to its knees on a nightly basis -- numerous backup scripts zipping enormous amounts of data to the same hard drive that is running the OS and which contains the database, etc. The problem is that I can't find the source of these backups. I suspect they are cron jobs but haven't been able to locate them all or match up specific cron jobs with specific greedy processes.
Thanks for the man tip, I am somewhat familiar with the find command but am not sure exactly what you are proposing? As I mentioned, I need to find the ultimate source of these processes rather than their output files. Also, this is a *production* server which no doubt has some date-dependent functions so I'm reluctant to change the date on it. The only thing I can think to do is go eyeball all the cron jobs and look inside them for tar commands or gzip commands.
|
|
|
01-28-2011, 09:53 PM
|
#4
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,314
|
Quote:
Originally Posted by sneakyimp
-- numerous backup scripts zipping enormous amounts of data to the same hard drive that is running the OS and which contains the database, etc.
|
Seriously bad configuration.
As it's a server, do you have auditd running ? - it should be able to tell you. You can turn PPID on in top and see if shows anything useful; "ps" can export the same data.
If nothing else works set up a wrapper around the command(s) of interest and spit out a message with the caller.
|
|
|
02-03-2011, 07:58 PM
|
#5
|
Senior Member
Registered: Dec 2004
Posts: 1,056
Original Poster
Rep:
|
Thanks for the responses here. It may have been a bad idea, but we sent the cron job issue back to the tech support crew at the hosting company as it was they who screwed it up in the first place. Looks like we'll be moving the server elsewhere eventually, so we're limping by in the meantime.
Syg00:
Although I don't know what it does, auditd does appear to be running:
Code:
root@server [~]# ps -aux | grep auditd
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.7/FAQ
root 560 0.0 0.0 0 0 ? S< Jan28 0:00 [kauditd]
root 3132 0.0 0.0 92888 928 ? S<sl Jan28 0:10 auditd
root 31817 0.0 0.0 61160 716 pts/0 D+ 18:55 0:00 grep auditd
Any tips on how to get info from it?
PPID? Do you mean process ids? I have turned on the command line which didn't look too useful for tracing the origin.
What do you mean by 'set up a wrapper around the commands of interest' ?
|
|
|
02-03-2011, 08:40 PM
|
#6
|
LQ Guru
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,196
|
A loaded server becomes slow, but does not crash. Not Linux
IF the archive processes crash the server, it could be because of lack of disk space. Are you low on free space?
You have to start somewhere. Does the machine crash at exactly the same time? You could start a top command in batch mode, running every second and pipe the output into a file. Once the server crashes, the top command has stopped as well and you can make a post-mortem dump and see if there were processes using excessive resources.
jlinkels
|
|
|
02-04-2011, 03:42 AM
|
#7
|
LQ Guru
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 17,213
|
Sorry for being unclear. I was suggesting setting the time to just after the crash, and then running a find -atime & find -ctime just after crash time so you could see the very recent files accessed & created. But as you say, if it's online, setting time is unwise
|
|
|
02-04-2011, 09:56 AM
|
#8
|
Senior Member
Registered: Dec 2004
Posts: 1,056
Original Poster
Rep:
|
Quote:
Originally Posted by jlinkels
A loaded server becomes slow, but does not crash. Not Linux
IF the archive processes crash the server, it could be because of lack of disk space. Are you low on free space?
You have to start somewhere. Does the machine crash at exactly the same time? You could start a top command in batch mode, running every second and pipe the output into a file. Once the server crashes, the top command has stopped as well and you can make a post-mortem dump and see if there were processes using excessive resources.
jlinkels
|
A well-configured Linux server doesn't crash, but one configured by a hack may crash under adverse circumstances
The hard drive is about 50% full. That should be enough to keep trudging along. I've instructed them to get another hard drive which, amazingly, is going to require a different machine due to lack of hard drive space in the chassis. Amazing.
The machine does not always crash at exactly the same time but closely enough (early AM) that we suspect the backup processes. I learned that there were a number of different (paranoid) backups that were trying to GZIP dozens of GB of images from one place on the hard drive to another for some inexplicable reason. It was overwhelming both the CPU and the hard drive for hours at a time. I have tried my best to put an end to that backup nonsense and the server has now been up about a week with no crashes AFAIK.
If the problems continue, the top output in batch mode sounds pretty good. I wish there was some way to locate the origin point of a given process though (e.g, launched from a cron job, an apache process, etc). That would make life so much easier.
|
|
|
02-04-2011, 09:58 AM
|
#9
|
Senior Member
Registered: Dec 2004
Posts: 1,056
Original Poster
Rep:
|
Quote:
Originally Posted by business_kid
Sorry for being unclear. I was suggesting setting the time to just after the crash, and then running a find -atime & find -ctime just after crash time so you could see the very recent files accessed & created. But as you say, if it's online, setting time is unwise
|
I appreciate your suggestion, but yes the server is a production server. The ill-advised backup procedures were not initially a problem because the site had few files. As image files have been uploaded, it has grown too large so the backup processes have become onerous.
Also, knowing what the output files are is not nearly as helpful as knowing their provenance! I want to know who spawns the processes that are chewing up resources.
|
|
|
02-04-2011, 10:17 AM
|
#10
|
LQ Guru
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,196
|
Quote:
Originally Posted by sneakyimp
If the problems continue, the top output in batch mode sounds pretty good. I wish there was some way to locate the origin point of a given process though (e.g, launched from a cron job, an apache process, etc). That would make life so much easier.
|
You can add the 'b' column to top as to display the parent process. You can write a simple script to call ps aux every second and pipe that output into a file as well. You could even grep for the gzip processes, get the pid, and do a cat /proc/<pid>.
I also remember now there is a command called pstree which shows the complete process tree. Maybe that sheds some light.
jlinkels
|
|
|
02-04-2011, 11:21 AM
|
#11
|
Senior Member
Registered: Dec 2004
Posts: 1,056
Original Poster
Rep:
|
Thanks for the tips, jlinkels. I'll be tinkering with those commands when I get a chance.
BTW, did you ever get the function buttons or the bluetooth running on your wife's eeePC?
|
|
|
All times are GMT -5. The time now is 06:41 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|