Server overload apache/php/mysql

jbenezech · 06-27-2012, 11:24 PM

Hi all,
I'm posting this in the apache forum although the topic is broader I think.
I've been asked to look at a site generating frequent 500 internal errors. The site is an apache/joomla/mysql install run on a dedicated linux server.
Looking at the server, I see extremely heavy load averages: around 30 in normal hours and up to 90 during busy hours.
I figure there is some sort of configuration problem but haven't been able to find where (apache, mysql, php). Note that mod_suphp is enabled and php scripts are run as site owner. Php5 used.
I'd appreciate your inputs.

Here are some infos:

Apache:

Apache/1.3.42 (Unix) mod_gzip/1.3.26.1a mod_log_bytes/1.2 mod_bwlimited/1.4 mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 configured

Top output (2208 http connections at the time):

top - 10:30:55 up 28 days, 16:48, 1 user, load average: 44.47, 38.04, 36.38
Tasks: 285 total, 4 running, 275 sleeping, 0 stopped, 6 zombie
Cpu(s): 45.3%us, 9.9%sy, 0.0%ni, 43.6%id, 1.1%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 12462248k total, 5214072k used, 7248176k free, 409068k buffers
Swap: 2104504k total, 144k used, 2104360k free, 3732868k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4597 myuser 17 0 30568 13m 6840 R 95.9 0.1 0:00.49 php
4485 myuser 16 0 100m 21m 7296 R 84.1 0.2 0:02.05 php
23401 mysql 15 0 2432m 366m 4296 S 68.5 3.0 614:01.70 mysqld
4567 myuser 16 0 105m 24m 7560 R 54.8 0.2 0:01.90 php
4569 myuser 16 0 108m 28m 7672 S 50.9 0.2 0:01.97 php
4589 myuser 17 0 96252 13m 6932 R 47.0 0.1 0:00.33 php
4595 myuser 17 0 0 0 0 Z 37.2 0.0 0:00.20 php <defunct>

mysqladmin proc stat

Uptime: 71024 Threads: 22 Questions: 17147915 Slow queries: 7466 Opens: 3723 Flush tables: 1 Open tables: 1950 Queries per second avg: 241.438

zer0signal · 06-28-2012, 01:19 AM

what are the zombie processes i see top is showing...

jbenezech · 06-28-2012, 02:37 AM

I think it's only finishing processes (php, httpd, exim). No process is staying in zombie state, although some php are staying in that state for a few seconds.

unSpawn · 06-28-2012, 03:04 AM

500 return codes are the "bad" ones: bad configuration (internal error), bad code, bad time (server busy), etc, etc, so it does not suffice to post the server string and top information.
* BTW if you want to run 'top' then make the output more meaningful by temporarily renaming your ~/.toprc and saving this as ~/.toprc:

Code:

RCfile for "top with windows"
Id:a, Mode_altscr=0, Mode_irixps=1, Delay_time=1.000, Curwin=0
Def     fieldscur=AEHIOQTWKNMbcdfgjplrsuvyzX
        winflags=30201, sortindx=10, maxtasks=0
        summclr=1, msgsclr=1, headclr=3, taskclr=1
Job     fieldscur=ABcefgjlrstuvyzMKNHIWOPQDX
        winflags=62777, sortindx=0, maxtasks=0
        summclr=6, msgsclr=6, headclr=7, taskclr=6
Mem     fieldscur=ANOPQRSTUVbcdefgjlmyzWHIKX
        winflags=62777, sortindx=13, maxtasks=0
        summclr=5, msgsclr=5, headclr=4, taskclr=5
Usr     fieldscur=ABDECGfhijlopqrstuvyzMKNWX
        winflags=62777, sortindx=4, maxtasks=0
        summclr=3, msgsclr=3, headclr=2, taskclr=3

and then run this one shot invocation: 'top -u myuser -n 1 -H 2>&1 | tee /path/to/top.log'.
- Take stock of the exact application versions that make up the web stack (web server, database and interpreter) and get a grip on what runs on top of that so that means Joomla, its configuration and including all plug-ins and homebrewn scripts. If versions don't match with the latest stable the vendor supplies you know what to do.
- Look for clues in system and daemon log files and include recently rotated log files. Logwatch can generate a report for you, making it easier to look for leads.
- Watching performance, start with mysqltop(-like) process list watching like 'watch -d 'mysql -e"show processlist;"';' or mtop or mytop, and wtop/logrep or any apachetop(-like) tool. As for the ^.*top$ tools watch out as there's different implementations around, some seemingly more recent than others.

jbenezech · 06-28-2012, 03:52 AM

Thanks for the tips. I know query optimization is definitely needed given joomla framework but there must be something else that brings down this server which should be able to handle the load (8 processors, 12GB mem). I've seen some problems in logs (premature end of script header for instance) but which only happen when the server is overloaded. The 500 errors also only happen when there is too much load.
I'll start with mtop and see what happens.

unSpawn · 06-28-2012, 04:46 AM

Quote:

Originally Posted by jbenezech

I've seen some problems in logs

I choose dealing with cold, hard unambiguous data over accounts of things. Could it be that in your case more pairs of eyeballs could see more? As in attach any reporting and whatever tool output you have?

syg00 · 06-28-2012, 05:45 AM

In addition to all that, when you see problems, run these and post the output

Code:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}' > topsave.txt
grep -i blocked /proc/stat

The first will list tasks in uninterruptible sleep, and the second the number blocked on I/O (no, they ain't necessarily the same).

zer0signal · 06-28-2012, 12:53 PM

I may be out of my league here, but do you have Sysstat running? You could use it too seem whats going on in a 24hr/10day range..

man sar

jbenezech · 07-04-2012, 05:00 AM

Sorry I could not post more details earlier. I was trying some tools you suggested and tried to see what was relevant or not in the logs.

First of all, regarding config:
Apache:
Apache/1.3.42 (Unix) mod_gzip/1.3.26.1a mod_log_bytes/1.2 mod_bwlimited/1.4 mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2.
8.31 OpenSSL/0.9.8e-fips-rhel5 configured -- resuming normal operations
[Wed Jul 4 05:32:07 2012] [notice] suEXEC mechanism enabled (wrapper: /usr/local/apache/bin/suexec)

Joomla: 1.5
PHP: 5
MYSQL: Server version: 5.0.91-community-log MySQL Community Edition

I had found some errors in the sys message log about full EXT3 directory. Cleaning the /tmp directory and adding a crontab to keep it clean solved that problem.

I changed some apache config, mainly put KeepAlive On again with a KeepAliveTimeout at 5 as I had noticed thousands of sockets in TIME_WAIT state.
MaxRequestsPerChild 250
MaxClients 1000
KeepAlive On
Timeout 300
KeepAliveTimeout 5
MaxKeepAliveRequests 250

Using mysqltop, I saw a number of queries being locked by constant table updates. These where caused by joomla updating the main table jos_content to increment page hit counter. I suppressed this counter.

After these changes, the server is still struggling with 30% load average but we did not notice 500 internals. We'll it's back today.

Here are some outputs at the time of the problem:

> netstat | grep http | wc -l
558

> netstat -alpn | grep :80 | awk '{print $5}' |awk -F: '{print $(NF-1)}' |sort | uniq -u | wc -l
69

> mysqladmin -i1 extended | grep Threads_connected
16

>ps -el | grep D
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
0 D 0 12436 12001 1 78 0 - 957 - pts/1 00:00:00 rm
5 D 504 22121 5073 0 86 10 - 5923 - ? 00:00:08 pure-ftpd

See also attached mysqlprocstat.txt, top.log as per recommandations previously, topsave.txt as per previous request

Finally, a strace of a processs showed a line with a very long execution time:
1114 read(3, "", 4096) = 0 <5.870374>

I could atach the trace file but it's quite big and I'm not sure which lines would be relevant. Please let me know.

syg00 · 07-04-2012, 05:22 AM

Well I expected to see (lots) more status "D" tasks - blows my theory away.
Pity you didn't get that blocked count I asked for as well ...

TenTenths · 07-04-2012, 06:08 AM

Your website must be generating some traffic if you're hitting 225 MySQL queries a second. My vBulletin site doing 1.5M pages a month only averages 85 Q/s

Have you considered that your site may need a dedicated database server instead of trying to run the web server and the database server on the same box?

jbenezech · 07-04-2012, 05:46 PM

Blocked count was 2, sorry didn't post. Thought the list was enough.
Regarding the queries per second on mysql, I find this number very suspicious. Monitoring mysql with mtop with a refresh of 2s, I see max 6-7 queries running. I don't know how to reset this number to start fresh stats. I think the site might have been victim of DOS attack 2 months ago and maybe then thousands of queries were sent and the average doesn't reflect current reality. Another theory (I don't know how mysql calculates this number) is maybe that every sub-query is counted as a query. Default joomla queries are usually very poorly optimized and made of inner/outer joins, sub-queries...

jbenezech · 07-05-2012, 12:05 AM

I managed to strace httpd process when getting a 500. The sudo php child process is being terminated by SIGKILL:

5685 poll([{fd=5, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) <0.000010>
5685 write(5, "\212\2\0\0\3SELECT * FROM jos_content I"..., 654) = 654 <0.053426>
5685 read(5, <unfinished ...>
5685 +++ killed by SIGKILL +++
32407 <... read resumed> "", 4096) = 0 <26.886731>
32407 --- SIGCHLD (Child exited) @ 0 (0) ---
32407 time(NULL) = 1341464085 <0.000005>
32407 write(15, "[Thu Jul 5 11:54:45 2012] [erro"..., 135) = 135 <0.000009>

Am I hitting some kind of server limit ?

ulimit -a:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 212992
max locked memory (kbytes, -l) 2048
max memory size (kbytes, -m) unlimited
open files (-n) 50000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 28670
cpu time (seconds, -t) unlimited
max user processes (-u) 14335
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

jbenezech · 07-05-2012, 07:22 PM

I did some more strace and it seems that the 500 are due to this SIGKILL. This KILL doesn't seem to be linked to any particular action (mysql, file system...):

5685 read(5, <unfinished ...>
5685 +++ killed by SIGKILL +++
32407 <... read resumed> "", 4096) = 0 <26.886731>
32407 --- SIGCHLD (Child exited) @ 0 (0) ---

19454 open("/opt/xslt//lib/libnss_files.so.2", O_RDONLY <unfinished ...>
19454 +++ killed by SIGKILL +++
32030 <... read resumed> "", 4096) = 0 <0.558116>
32030 --- SIGCHLD (Child exited) @ 0 (0) ---

16359 mkdir("/tmp/eaccelerator/8/d", 0777) = -1 EEXIST (File exists) <0.000010>
16359 umask(022) = 0 <0.000007>
16359 +++ killed by SIGKILL +++
32030 <... read resumed> "", 4096) = 0 <0.241756>
32030 --- SIGCHLD (Child exited) @ 0 (0) ---

jbenezech · 07-10-2012, 03:41 AM

Still no clue what's going on but I think I found what was killing the processes.
The host runs a perl script which checks every 5 seconds the load average and kills some processes, starting with the php processes if load average is too high. Processes are getting killed with SIGKILL which is why I guess the end user gets a 500 error.
Back to the root of the problem, this is happening because the server is over loaded on CPU but I still can't figure out why.

I don't think the problem comes from memory, there are always a few GB available.
Not sure it comes from IO either, iowait is always around 1%, await from iostat averages to 4.70 (but with picks to 300).

The top command shows processes using 100% CPU, mainly mysql and php.
I've run quite a few tools and the only thing I can see not being quite right is the sar output showing an average of 35 in run queue size.
Also the number of context switching seems high around 2000 /s.

So it looks like some processes get access to 100% CPU while the other ones wait. Next one gets 100% and so on. How come this is not better balanced ?