Server overload apache/php/mysql
Hi all,
I'm posting this in the apache forum although the topic is broader I think. I've been asked to look at a site generating frequent 500 internal errors. The site is an apache/joomla/mysql install run on a dedicated linux server. Looking at the server, I see extremely heavy load averages: around 30 in normal hours and up to 90 during busy hours. I figure there is some sort of configuration problem but haven't been able to find where (apache, mysql, php). Note that mod_suphp is enabled and php scripts are run as site owner. Php5 used. I'd appreciate your inputs. Here are some infos: Apache: Apache/1.3.42 (Unix) mod_gzip/1.3.26.1a mod_log_bytes/1.2 mod_bwlimited/1.4 mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 configured Top output (2208 http connections at the time): top - 10:30:55 up 28 days, 16:48, 1 user, load average: 44.47, 38.04, 36.38 Tasks: 285 total, 4 running, 275 sleeping, 0 stopped, 6 zombie Cpu(s): 45.3%us, 9.9%sy, 0.0%ni, 43.6%id, 1.1%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 12462248k total, 5214072k used, 7248176k free, 409068k buffers Swap: 2104504k total, 144k used, 2104360k free, 3732868k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4597 myuser 17 0 30568 13m 6840 R 95.9 0.1 0:00.49 php 4485 myuser 16 0 100m 21m 7296 R 84.1 0.2 0:02.05 php 23401 mysql 15 0 2432m 366m 4296 S 68.5 3.0 614:01.70 mysqld 4567 myuser 16 0 105m 24m 7560 R 54.8 0.2 0:01.90 php 4569 myuser 16 0 108m 28m 7672 S 50.9 0.2 0:01.97 php 4589 myuser 17 0 96252 13m 6932 R 47.0 0.1 0:00.33 php 4595 myuser 17 0 0 0 0 Z 37.2 0.0 0:00.20 php <defunct> mysqladmin proc stat Uptime: 71024 Threads: 22 Questions: 17147915 Slow queries: 7466 Opens: 3723 Flush tables: 1 Open tables: 1950 Queries per second avg: 241.438 |
what are the zombie processes i see top is showing...
|
I think it's only finishing processes (php, httpd, exim). No process is staying in zombie state, although some php are staying in that state for a few seconds.
|
500 return codes are the "bad" ones: bad configuration (internal error), bad code, bad time (server busy), etc, etc, so it does not suffice to post the server string and top information.
* BTW if you want to run 'top' then make the output more meaningful by temporarily renaming your ~/.toprc and saving this as ~/.toprc: Code:
RCfile for "top with windows" - Take stock of the exact application versions that make up the web stack (web server, database and interpreter) and get a grip on what runs on top of that so that means Joomla, its configuration and including all plug-ins and homebrewn scripts. If versions don't match with the latest stable the vendor supplies you know what to do. - Look for clues in system and daemon log files and include recently rotated log files. Logwatch can generate a report for you, making it easier to look for leads. - Watching performance, start with mysqltop(-like) process list watching like 'watch -d 'mysql -e"show processlist;"';' or mtop or mytop, and wtop/logrep or any apachetop(-like) tool. As for the ^.*top$ tools watch out as there's different implementations around, some seemingly more recent than others. |
Thanks for the tips. I know query optimization is definitely needed given joomla framework but there must be something else that brings down this server which should be able to handle the load (8 processors, 12GB mem). I've seen some problems in logs (premature end of script header for instance) but which only happen when the server is overloaded. The 500 errors also only happen when there is too much load.
I'll start with mtop and see what happens. |
Quote:
|
In addition to all that, when you see problems, run these and post the output
Code:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}' > topsave.txt |
I may be out of my league here, but do you have Sysstat running? You could use it too seem whats going on in a 24hr/10day range..
man sar |
3 Attachment(s)
Sorry I could not post more details earlier. I was trying some tools you suggested and tried to see what was relevant or not in the logs.
First of all, regarding config: Apache: Apache/1.3.42 (Unix) mod_gzip/1.3.26.1a mod_log_bytes/1.2 mod_bwlimited/1.4 mod_auth_passthrough/1.8 FrontPage/5.0.2.2635 mod_ssl/2. 8.31 OpenSSL/0.9.8e-fips-rhel5 configured -- resuming normal operations [Wed Jul 4 05:32:07 2012] [notice] suEXEC mechanism enabled (wrapper: /usr/local/apache/bin/suexec) Joomla: 1.5 PHP: 5 MYSQL: Server version: 5.0.91-community-log MySQL Community Edition I had found some errors in the sys message log about full EXT3 directory. Cleaning the /tmp directory and adding a crontab to keep it clean solved that problem. I changed some apache config, mainly put KeepAlive On again with a KeepAliveTimeout at 5 as I had noticed thousands of sockets in TIME_WAIT state. MaxRequestsPerChild 250 MaxClients 1000 KeepAlive On Timeout 300 KeepAliveTimeout 5 MaxKeepAliveRequests 250 Using mysqltop, I saw a number of queries being locked by constant table updates. These where caused by joomla updating the main table jos_content to increment page hit counter. I suppressed this counter. After these changes, the server is still struggling with 30% load average but we did not notice 500 internals. We'll it's back today. Here are some outputs at the time of the problem: > netstat | grep http | wc -l 558 > netstat -alpn | grep :80 | awk '{print $5}' |awk -F: '{print $(NF-1)}' |sort | uniq -u | wc -l 69 > mysqladmin -i1 extended | grep Threads_connected 16 >ps -el | grep D F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 0 D 0 12436 12001 1 78 0 - 957 - pts/1 00:00:00 rm 5 D 504 22121 5073 0 86 10 - 5923 - ? 00:00:08 pure-ftpd See also attached mysqlprocstat.txt, top.log as per recommandations previously, topsave.txt as per previous request Finally, a strace of a processs showed a line with a very long execution time: 1114 read(3, "", 4096) = 0 <5.870374> I could atach the trace file but it's quite big and I'm not sure which lines would be relevant. Please let me know. |
Well I expected to see (lots) more status "D" tasks - blows my theory away.
Pity you didn't get that blocked count I asked for as well ... |
Your website must be generating some traffic if you're hitting 225 MySQL queries a second. My vBulletin site doing 1.5M pages a month only averages 85 Q/s
Have you considered that your site may need a dedicated database server instead of trying to run the web server and the database server on the same box? |
Blocked count was 2, sorry didn't post. Thought the list was enough.
Regarding the queries per second on mysql, I find this number very suspicious. Monitoring mysql with mtop with a refresh of 2s, I see max 6-7 queries running. I don't know how to reset this number to start fresh stats. I think the site might have been victim of DOS attack 2 months ago and maybe then thousands of queries were sent and the average doesn't reflect current reality. Another theory (I don't know how mysql calculates this number) is maybe that every sub-query is counted as a query. Default joomla queries are usually very poorly optimized and made of inner/outer joins, sub-queries... |
I managed to strace httpd process when getting a 500. The sudo php child process is being terminated by SIGKILL:
5685 poll([{fd=5, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) <0.000010> 5685 write(5, "\212\2\0\0\3SELECT * FROM jos_content I"..., 654) = 654 <0.053426> 5685 read(5, <unfinished ...> 5685 +++ killed by SIGKILL +++ 32407 <... read resumed> "", 4096) = 0 <26.886731> 32407 --- SIGCHLD (Child exited) @ 0 (0) --- 32407 time(NULL) = 1341464085 <0.000005> 32407 write(15, "[Thu Jul 5 11:54:45 2012] [erro"..., 135) = 135 <0.000009> Am I hitting some kind of server limit ? ulimit -a: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 212992 max locked memory (kbytes, -l) 2048 max memory size (kbytes, -m) unlimited open files (-n) 50000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 28670 cpu time (seconds, -t) unlimited max user processes (-u) 14335 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited |
Sigkill
I did some more strace and it seems that the 500 are due to this SIGKILL. This KILL doesn't seem to be linked to any particular action (mysql, file system...):
5685 read(5, <unfinished ...> 5685 +++ killed by SIGKILL +++ 32407 <... read resumed> "", 4096) = 0 <26.886731> 32407 --- SIGCHLD (Child exited) @ 0 (0) --- 19454 open("/opt/xslt//lib/libnss_files.so.2", O_RDONLY <unfinished ...> 19454 +++ killed by SIGKILL +++ 32030 <... read resumed> "", 4096) = 0 <0.558116> 32030 --- SIGCHLD (Child exited) @ 0 (0) --- 16359 mkdir("/tmp/eaccelerator/8/d", 0777) = -1 EEXIST (File exists) <0.000010> 16359 umask(022) = 0 <0.000007> 16359 +++ killed by SIGKILL +++ 32030 <... read resumed> "", 4096) = 0 <0.241756> 32030 --- SIGCHLD (Child exited) @ 0 (0) --- |
Still no clue what's going on but I think I found what was killing the processes.
The host runs a perl script which checks every 5 seconds the load average and kills some processes, starting with the php processes if load average is too high. Processes are getting killed with SIGKILL which is why I guess the end user gets a 500 error. Back to the root of the problem, this is happening because the server is over loaded on CPU but I still can't figure out why. I don't think the problem comes from memory, there are always a few GB available. Not sure it comes from IO either, iowait is always around 1%, await from iostat averages to 4.70 (but with picks to 300). The top command shows processes using 100% CPU, mainly mysql and php. I've run quite a few tools and the only thing I can see not being quite right is the sar output showing an average of 35 in run queue size. Also the number of context switching seems high around 2000 /s. So it looks like some processes get access to 100% CPU while the other ones wait. Next one gets 100% and so on. How come this is not better balanced ? |
All times are GMT -5. The time now is 09:05 AM. |