Wait IO suddenly extremely high (exploit ?), crashing the server

gcat · 04-07-2012, 03:33 PM

Hello,

My server is constantly crashing or slowing down, and as my hosting company does nothing, I come here for some help ! This is my first time on this forum, and I am quite a newbie at Linux, so I will try to give all the relevant information but please tell me whatever you would need to better understand the problem...

My system :
Cent OS 6
Linux 2.6.32-71.29.1.el6.x86_64
Plesk Parallels Panel 10.4.4

The problem is :
Since last week, the server suddenly bore a huge load on "Wait-IO", slowing down the pages, crashing the system, even melting the processor several times, as we can see on these two images of my CPU load : http://img27.imageshack.us/img27/3989/imagesga.png (the load during the whole week, where we can see the Wait IO load starting) and http://img19.imageshack.us/img19/5351/image1tw.png (12 hours, more recenly).

Is it caused by a virus forcing IO commands to increase the load ?
As I was told that could be the reason, the first thing I did was to change my Plesk password, as well as the one to access the server with Shell. I then launched a scan via the 'watchdog' security module of Plesk (rootkit hunter v. 1.3.4) which found several problems. After launching a few times, the number of infections decreased, and now remain the following :
Performing trojan specific checks
Checking for enabled xinetd services [ Warning ]
Performing system boot checks
Checking for local host name [ Found ]
Checking for system startup files [ Found ]
Performing group and account checks
Checking for passwd file [ Found ]
Performing system configuration file checks
Checking for SSH configuration file [ Found ]
Checking if SSH root access is allowed [ Not set ]
Checking if SSH protocol v1 is allowed [ Not allowed ]
Checking for running syslog daemon [ Found ]
Checking for syslog configuration file [ Found ]
Performing filesystem checks
Checking /dev for suspicious file types [ None found ]
Checking for hidden files and directories [ Warning ]
Checking version of Apache [ Warning ]

=> How can I delete the infection ? Can't I install an antivirus deleting or repairing the corrupted files ? Is it really related ... ?

I also ran iotop and often get "jbd2" or "mysqld" as the processes that use provoke the biggest CPU load (such as the following)

TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
542 be/3 root 0.00 B/s 37.73 K/s 0.00 % 99.99 % [jbd2/dm-1-8]
19885 be/4 apache 0.00 B/s 3.77 K/s 0.00 % 99.99 % httpd
502 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.86 % [kdmflush]
1675 be/4 root 3.66 K/s 10.98 K/s 0.00 % 0.00 % sw-collectd -C /etc/sw-collectd/collectd.conf

Please tell me if you need any log to help ...

Thank you in advance for your help, I hope you have a few ideas to help me resolve this problem !

Noway2 · 04-07-2012, 04:35 PM

Off the top of my head, my first guess would be that your running out of memory. You should look at the process utilization and see if your memory is full and you are utilizing swap space. This could account for a lot of time spent in an IO wait state.

The big process in the list is JBD, which is part of your journalling file system: http://en.wikipedia.org/wiki/Journaling_block_device, though this could be reflective of other activity. Folowing those you have Apache (httpd) and you statistics colleciton process from Plesk consuming your resources. If your also showing MySQL as a heavy usage, I would look at what your using MySQL for, perhaps something that is logging a lot of information to it. Or are you running some form of web service that would be writing a lot of information to a database?

Lastly, what do your logs indicate. Does dmesg show any type of error? What about the other logs, such as syslog or messages, http logs, and daemon log. Do they show any signs of trouble. Is there any sort of indexing going on in this system perhaps?

I would recommend looking for mundane causes of trouble before you start down the path of intruder detection and only begin a compromise investigation after you have uncovered evidence and signs of an intrusion. As far as the warnings for RKHunter, I would advise you to read the manual and the configuration files as most of the time the warnings are false positives and once proven as such are easily dealt with. Before you remove the warnings, be sure to verify that the issue is legitimate for your distribution.

gcat · 04-08-2012, 02:59 AM

Thank you Noway2 for your help,

- Lack of Memory :
It doesn't seem to be running out of memory, as we can see on the following image, the "CPU usage" is full red because of "wait-IO" while the memory usage is only 47%.
http://img42.imageshack.us/img42/885...parallelsp.png
(by the way I think I am faulty in my my.cnf parameters as I use 3 CPU cores and very few of my memory.)
My my.cnf settings are :
max_connections = 201
thread_cache = 8
query_cache_size = 64M
log_slow_queries = ON
table_cache = 350 (I have 186 tables in my own db)
key_buffer_size = 4M
tmp_table_size = 32M
sort_buffer = 4M
join_buffer_size = 512K

- Swap space :
How could I check it ?
According to "top" my usage is now :

Quote:

Swap: 1959920k total, 173588k used, 1786332k free, 407512k cached

- database usage :
I don't have lots of write queries on my database. Most of those queries (maybe 99%) are only 'select' (this is a typical web magazine with information about music).
It indeed seems that my mysql logs are huge.
My mysqld.log weighs 75 Mb (since January 14th), mostly with errors related to tables that have crashed (when the server crashed because of the huge "wait IO" it corrupted several tables).

- logs
Here the logs you asked for, but I don't know how to find any error in them ...
* dmesg :
http://freetexthost.com/kigzeru6lm
* messages :
http://freetexthost.com/2utkayogag
* syslog, daemon, httpd :
Where can I find them ?
I tried on /var/log/httpd/ which contains 10 'access_log-date', and 'error_log-date' weighing 10Mb, mostly with lines as

Quote:

::1 - - [01/Apr/2012:04:23:36 +0200] "OPTIONS * HTTP/1.0" 200 - "-" "Apache (internal dummy connection)"

for access_log, or

Quote:

Warning: Directive 'safe_mode' is deprecated in PHP 5.3 and greater in Unknown on line 0

for error_log

One important fact to notice is that these "wait IO" suddenly disappear for a few hours, slowly increase and remain huge for some time (a day or two...)

I hope these information will help in detecting the problem, whatever it is ...
Please tell me if I should look into other files

Thank you in advance for your answer !

Noway2 · 04-08-2012, 08:54 AM

The DMESG log doesn't indicate a hardware failure, so unless it is something that would only appear on the host machine we can rule that out. Your also not running out of memory, so it isn't processes being swapped in and out. It is interesting that in your initial post that HTTPD and MySQL are both engaging in heavy write access, yet in your follow up post you indicate that you should have a lot of read queries, not write queries. Coupled with the large amount of warning messages that are being written, presumably when someone access your system, I am wondering if this may be the cause of the usage or at least exacerbating the problem. Lets start by seeing if we can eliminate or at least reduce the amount of logging, that is being done with the spurious messages, which may free up the disk and thereby reduce the IO wait.

First, have a look at this page: http://syslog.tv/2010/03/27/apache-i...my-connection/ It talks about the dummy connections, which apparently are a means for apache to wake up the child handlers.. There are a couple of ways to supress or eliminate this message. The article shows one (using a rewrite rule) and I think it links to other options from the Apache wiki.

Second, this is a PHP based warning. See this link: http://forum.parallels.com/showthread.php?t=113374 It says in order to disable this message to remove safe_mode = "On/Off" from your php.ini.

Third, it seems like those log files are huge. I am thinking that you may need to look at logrotate to see if it is set to update those log files. I am near 100% certain that it handles at least some Apache log files, but if they are growing for days and weeks, you may need to either tweak it's configuration files or add an item for those log files. Remember you will need to tell the process to "hang up" or restart after closing down it's log file. In and of itself, a large log shouldn't make much of a difference as your only appending, but it still seems odd.

gcat · 04-09-2012, 11:22 AM

Thanks for your analysis and reply !

The change in .htacess of the page http://syslog.tv/2010/03/27/apache-i...my-connection/ led to 500 internal errors on all the pages, so I stopped it (will try later), but the change in php.ini (I put a ";" before safe_mode = ON to ignore it) seems to be ok.
I also checked some of the logrotate that were configured to rotate every 5 Mb so it seemed ok. ( I have not checked everything yet). I restarted the server just after.
But, is it really related to this high wait IO problem ?

All the day the server kept an incredible high but fluctuating load due to wait-IO.

iotop gave me the following processes running while the wait-IO was at its peak
(I copied the busiest processes running at different times and put them in the link below

http://freetexthost.com/xojeef6go5
Does it give you any hint about the source of the problem ?

Thanks again !

Noway2 · 04-09-2012, 09:32 PM

I will need to look at your iotop output more carefully than I have time to do right at the momement. Specifically, I noticed that data (% data) is in multiple columns. I would be curious to know if that is percentage read and percentage write. I would also be curious if eliminating the PHP warnings has had any effect in reducing the percentage that HTTPD is contributing to your IO wait. From your last log output, it looks like MYSQL is the big user overall, lets make sure this is read activity and possibly try to associate it with legitimate connections.

I think you are faced with four possible scenarios:
1 - your virtualized server doesn't have enough horse power for the job.
2 - You have a process that is errently consuming a lot of resources.
3 - you have been compromised and or/are being subect to a type of DOS - denial of service attack.
4 - there is a problem elsewhere in your virtual host (another client) and you are stuck in an IO wait because of them.

In each of the cases, the first question to be addressed is exactly is causing the resource drain. It looks like you have a lot of activity updating your journalling file system, coupled with a lot of write activity that didn't have an obvious cause. Your data shows a large number of log entries being written, so it seems logical to try and remove these from the equation to see if something remains. Clearly something is amiss, as you are getting a lot of write activity, but no discernable explaination. I think your prority should be to try and see what is causing this.

As far as the 500 errors, with .htaccess, you also need to confiure the allow override directive in your vhost declaration to get it to accept a .htacess. Off hand I am not sure whether or not this is the cause of the error or not, but your log file may tell you.

If you wish to engage in a parallel path, I would recommend looking at the output of "lsof -pwn" and "netstat -pane" for starters and examine your connections and what files/resources are in use. Similarly, look for any modified web files and also hidden files, and especially any files with setuid. To perform a full-out compromise investigation would be more intrusvive to your system than the evidence currently warrants, so lets go with this approach.

As far as option 4, I think once we are able to gather enough of an evidence based approach, your provider will have a much harder time ignoring you and once you have ruled out all the possibilities on your end, you will have grounds to go back to them.

gcat · 04-10-2012, 08:30 AM

Hello,

Thanks again for your help.

- IOTOP :
As for iotop, the percentages are respectively SWAPIN and IO> (with IO> reaching 99.9% when the Wait IO is at its maximum), but I don't know which are for write or for read.
The whole line concerning the 'mysql' process in iotop is actually the following one :

Quote:

TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1842 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/~ --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock

This line popped up at 99.9% while the server reached its peak of Wait IO, but I don't seem any error in it ...

- .htaccess
That's strange ! I already have a .htaccess which works fine, with some lines very close :

Quote:

RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot
RewriteRule ^.*$ - [F]

But when I replace it with

Quote:

RewriteCond %{HTTP_USER_AGENT} ^.*internal dummy connection.*$ [NC]
RewriteRule ^/$ /blank.html [L]

then all my pages turn to 500 internal errors !
* Is there any mistake in these two lines ?

- Huge logs :
Via Plesk, I had the confirmation that the "access_log" and "error_log" files are well rotating (although I don't know where these files are on the server !) but the mysqld.log file kept growing since I started on this server in December 14th, and the file weighted 78 Mb (full of "table XXX has crashed") ! I tried the following link to setup a logrotate http://www.question-defense.com/2009...out-a-password but that didn't seem to work. All the points 1 to 5 are ok, but when I type in "/usr/bin/mysqladmin flush-logs" in the command, there is no change in my log files in /var/log/*.log !
I then renamed the logs and created new ones (and shared their ownerships with mysql). Now my server seems ok (wait IO is approximately 10% of my CPU, although it's still high compared to one month ago). I hope it will stay like this, but I'm afraid it will happen again ...
* Is there another way to configure a rotation of the logs ?
* By the way, I tried to copy the entire log to my desktop to read it easily, but is there any easyway to export a file from my server ? (apart of the scp command that I don't really understand !)

One problem is that I have lots of tables crashing when the server dies because of these Wait IO peaks.
* Is there a way to automatically detect this and repair these tables without doing a cron ?

- lsof
I tried lsof -pwn and netstat (but at a time where the wait IO was low !) and it gave my this : (I don't know its meaning ...)

Quote:

lsof: illegal process ID: wn
lsof 4.82
latest revision: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/
latest FAQ: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/FAQ
latest man page: ftp://lsof.itap.purdue.edu/pub/tools/unix/lsof/lsof_man
usage: [-?abhlnNoOPRtUvVX] [+|-c c] [+|-d s] [+D D] [+|-f[gG]]
[-F [f]] [-g [s]] [-i [i]] [+|-L [l]] [+m [m]] [+|-M] [-o [o]] [-p s]
[+|-r [t]] [-s [p:s]] [-S [t]] [-T [t]] [-u s] [+|-w] [-x [fl]] [--] [names]

And the netstat gave me : http://freetexthost.com/obsbyymtcp
(lots of TIME_WAIT as below)

Quote:

tcp 0 0 ::ffff:server_ip:80 ::ffff:212.126.28.133:58179 TIME_WAIT 0 0 -

* How to interpret these ? I don't see the files or resources in use...
I don't know 'setuid'. How can I use it ?

- MySQL
* Is there a way to check the illegitimate connections, or DOS attacks ? With iptables maybe ?

As for the option 1, I don't think I lack horsepower, as usually my CPU is only at 50% (reaching at best 80%), as well as the memory which is between 10% to 50%.

gcat · 04-10-2012, 10:18 AM

The huge wait IO load happened again, for 30 minutes ! (see http://img718.imageshack.us/img718/4606/8216525137.png)

I stumbled upon this thread on another forum, which seems extremely similar (and recent, only a few days ago, so the same as me) ! http://www.webhostingtalk.com/showthread.php?t=1142717

I checked netstat at this moment, and much like the problem explained on this page, I found lots of crawlers ... and the baiduspider too !

Quote:

tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-184.g:51133 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http 66.154.119.75.static.:56942 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-237.g:63017 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-4.goo:43073 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-184.g:50676 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-81.go:55615 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http 66.154.119.75.static.:57003 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-145.g:41998 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-119.g:51817 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-199.g:43176 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http host116-245-dynamic.1:58278 ESTABLISHED
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-195.g:35036 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-231.g:36577 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-163.g:51104 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http baiduspider-180-76-5-:20947 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-170.g:56828 TIME_WAIT
tcp 0 0 s15954151.onlinehome-s:http crawl-66-249-72-232.g:39344 TIME_WAIT

Should I try to block this ? How can I do this (with .htaccess?)

At this moment, iotop gave this :

Quote:

TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
16777 be/4 mysql 408.46 K/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16558 be/4 mysql 172.12 K/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
1688 be/4 root 10.28 K/s 7.71 K/s 0.00 % 99.99 % sw-collectd -C /etc/sw-collectd/collectd.conf
17108 be/4 apache 251.75 K/s 0.00 B/s 0.00 % 99.99 % httpd
16645 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 99.99 % httpd
17113 be/4 apache 5.14 K/s 0.00 B/s 0.00 % 99.99 % httpd
16547 be/4 mysql 220.93 K/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16517 be/4 mysql 43.67 K/s 0.00 B/s 2.23 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16246 be/4 mysql 503.51 K/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16259 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16474 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16673 be/4 apache 138.72 K/s 5.14 K/s 0.00 % 99.99 % httpd
16560 be/4 mysql 482.95 K/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16697 be/4 mysql 30.83 K/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
15549 be/4 popuser 331.39 K/s 0.00 B/s 0.00 % 99.99 % imapd Maildir
16511 be/4 mysql 110.46 K/s 0.00 B/s 0.00 % 99.99 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
17109 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 99.99 % httpd
17071 be/4 psaadm 292.86 K/s 0.00 B/s 0.00 % 91.54 % sw-engine-cgi -c /usr/local/psa/admin/conf/php.ini -d auto_prepend_file=auth.php3 -u psaadm
16622 be/4 mysql 41.10 K/s 0.00 B/s 0.00 % 30.78 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock
16648 be/4 mysql 28.26 K/s 0.00 B/s 99.99 % 28.89 % mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock

As for the .htaccess, I wrote the same thing as on this forum
RewriteCond %{HTTP_USER_AGENT} ^.*internal\ dummy\ connection.*$ [NC]
RewriteRule .* - [F,L]
with \ to escape the blanks and that seems to work (at least I don't have 500 internal errors) but I still have some

Quote:

::1 - - [10/Apr/2012:17:18:16 +0200] "OPTIONS * HTTP/1.0" 200 - "-" "Apache (internal dummy connection)"

Anyway, I am glad to have progressed a little !

Noway2 · 04-11-2012, 06:52 AM

Based upon the data you are collecting, I am starting to get a picture of what is happening in your system. Let me try to address your questions and see if I can try to explain what I think is happening.
1 - I think your data is conclusively showing IO wait due to MySQL as being the resource hog in your system. I think this is seen directly in your log files and is circumstantially supported by the large number of TIME_WAIT entries in your netstat which is reflective of a large number of connections having been recently established.
2 - I must appologize, I had an error in parameters of the lsof command. It should have been lsof -Pwn, note the capital.
3 - Off hand I am not sure what is wrong with the .htaccess. I would need to do some more searching to see if there is an alternate syntax or something. Apache's rewrite rule is not something I have had to use very often.
4 - You are getting a lot of search engine crawler traffic. Seeing as your problem seems to be related to number of connections, you may consider using a robots.txt file to tell them to go away, at least while your trying to diagnose the problem.
5 - How to interpret the nestat? Basically it is indication of a terminated connection. See the wikipedia page on TCP. The part of interest here is:

Quote:

IME-WAIT : Represents waiting for enough time to pass to be sure the remote peer received the acknowledgment of its connection termination request. According to RFC 793 a connection can stay in TIME-WAIT for a maximum of four minutes known as a MSL (maximum segment lifetime).

6 - Looking at your netstat output, I agree you have a LARGE number of connections occuring and this, possibly coupled with logging is my guess as to what is bogging your system down. The question is, is the traffic legitimate or is it a for of DOS? You will need to take a closer look at the connections to determine. What I mean is, are the connections and access activity such that someone would be legitimately browsing your content or are they rapid, connect - disconnect bursts (possibly a GET flood)?

7 - Looking at your netstat output, I don't see anything out of the ordinary that would indicate a malicious process. I did notice that your running FTP, which isn't the safest of tools out of the box. Specifically, please look at this line and determine if this was your (legitimate) traffic:

Code:

tcp 0 0 server_ip:21 61.215.66.66:57143 ESTABLISHED 0 531327 11773/proftpd: nano

You also have a lock of socket stream connections from Master which is Postfix. The number seemed a little large, but I would need to compare against another system to be sure.

8 - I would also recommned looking at this line:

Code:

tcp 0 52 server_ip:22 61.215.66.66:59968 ESTABLISHED 0 28318 2027/0

It looks like you may be logging into SSH as root. If so, I would strongly discourage this practice. Instead set up a non privileged user with key based authentication and then su to root as needed.

9 - the Logrotate page you linked to looks ok from what I can tell. It does take you through the steps to create a logrotate rule. One thing about log rotate is that it will only touch a file once in a 24 hour period, so it is possible that when you tried to do something that it politely refused because it had already touched the file or directory. I would also look to make sure you don't have an existing MySQL rule, you don't want a duplicate.

I would focus on looking at the traffic to determine if it is legitimate or hostile. I would also continue to look at ways to reduce the spurious "noise" in your files as this will reduce the IO load and help you to focus on the problem traffic. Maybe there is a way to reduce the log level in the applications compared to your current settings.

gcat · 04-11-2012, 12:22 PM

Hi noway2

Thanks again !

The wait IO load reached unprecedented levels, and crashed the server for nearly 24 hours ! I can't stand it anymore ...

I did the "lsof -Pwn" but there are so many lines, they can't be displayed in one screen.
I copied them (did it several times) to try to get most of it on these two links :
http://freetexthost.com/snri65iceb and
http://freetexthost.com/owfm3ft44n

This seems a huge amount of files read, but I don't know to interpret that ...

For information about the size of my websites, I have approximately 1 million pages views per month, and the database weighs several hundreds of Mb (in less than 200 tables), so I think it's still a small size website.

- Netstat :
The ftp connection was indeed mine ! (what would you recommend to upload my files to the server apart of FTP then ?)
I don't understand what are the stream connections from Master related to Postfix. How can I check if this is normal ? We indeed have lots of emails accounts (30), and maybe tens of thousands of emails, for information.
And yes, I log into SSH as root, it's not ok ! I will try to do as you say.

The netstat gave me the following :

Quote:

[root@s15954151 ~]# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
12 ESTABLISHED
8 FIN_WAIT1
5 FIN_WAIT2
2 LAST_ACK
19 LISTEN
24 SYN_RECV
1238 TIME_WAIT

And here is the full list :
http://freetexthost.com/cczydl3pxm

How could I make sure the traffic is legitimate ? I looked for info about DOS attacks, but don't know how to do !
For example, how could I ban the ip related to the "crawl-66-249-72-47" line that often appears in my netstat ?

Thanks in advance !

Noway2 · 04-13-2012, 07:21 AM

Gcat, I am writing to follow up with you on this subject (it would be better to keep the discussion in the forum rather than in emails to the extend possible).

My understanding is that you have implemented some filters to remove the baiduspider and while the IO wait is still high, your server is operational. Some of the other crawlers should respond to a robots.txt file, and these should be easier to limit. If your servers are functional, I would really recommend adding these filters slowly; make one change and watch for a difference. The netstat command shows you which connections are active and which have terminated recently along with the information about who or what was connected. What we noticed was that you had a lot of connections from spiders in a time-wait on closing. Per TCP protocol, the connection can remain in this zombie state for up to 4 minutes. While your netstat output didn't include time stamps, the fact that these connections can remain for 4 minutes says that these connections were all recent.

Once we get the crawlers under control, we can evaluate where things stand and determine if we need to make further adjustments.

I also wanted to discuss a couple of other sub-topics that came up, like alternatives to FTP. The best recommendation would be SCP, which you had mentioned having a little bit of difficulty with. The "trick" to SCP is to recognize that it can be used to either push files to or pull files from a server and the command syntax is "from | to". The remote connection is then coupled with SSH syntax. So for example, to upload file XYZ to your server you could:

Code:

scp <path-to-file>/XYZ you@remote-host.com:/<path-to-destination/

Note the colon ( : ) between the SSH login and the path. Similarly to pull XYZ from your server your could use

Code:

scp you@remote-host.com:/<path-to-file>/XYZ .
or 
scp you@remote-host.com:/<path-to-file>/XYZ /path-to-destination/

where the first one uses '.' to represent the current directory.

With regards to analyzing connections like ""crawl-66-249-72-47" the first thing you can do is run an nslookup query on the ip address, in this case 66.249.72.47, which is crawl-66-249-72-47.googlebot.com, google being a legitimate search engine that is likely indexing your site. One other thing you can watch for is the rate of connections as a human will not usually stream multiple requests in several seconds or continuously.

gcat · 04-16-2012, 10:01 AM

Thanks for your help, but that didn't work well.

I added lots of rules, as Unspawn advised :

Quote:

iptables -I INPUT -s 202.108.0.0/16 -j DROP # AS4808 / CHINAnet
iptables -I INPUT -s 61.135.0.0/16 -j DROP # AS4808 / CHINAnet
iptables -I INPUT -s 220.181.0.0/16 -j DROP # AS4808 / CHINAnet
iptables -I INPUT -s 123.125.64.0/18 -j DROP # AS4808 / CHINAnet
iptables -I INPUT -s 220.181.96.0/19 -j DROP # AS4808 / CHINAnet
iptables -I INPUT -s 61.135.160.0/21 -j DROP # AS4808 / CHINAnet
iptables -I INPUT -s 220.181.32.0/19 -j DROP # AS23724 / CHINAnet
iptables -I INPUT -s 61.208.0.0/16 -j DROP # AS4713 / CHINAnet
# Listed elswhere, not verified:
iptables -I INPUT -s 119.63.192.0/21 -j DROP
iptables -I INPUT -s 123.122.0.0/20 -j DROP

By the way, I tried to add

Quote:

-A INPUT -m tcp --dport 80 -m hashlimit --hashlimit 10/s --hashlimit-burst 15 --hashlimit-mode srcip,dstport --hashlimit-name HTTP -j LOG --log-prefix "HTTP_limiter "
-A INPUT -m tcp --dport 80 -m hashlimit --hashlimit 10/s --hashlimit-burst 15 --hashlimit-mode srcip,dstport --hashlimit-name HTTP -j ACCEPT

as Unspawn suggested by email, but I don't where to input it !
I tried

Quote:

/bin/iptables -A INPUT -m tcp --dport 80 -m hashlimit --hashlimit 10/s --hashlimit-burst 15 --hashlimit-mode srcip,dstport --hashlimit-name HTTP -j ACCEPT

but it returns "iptables: Invalid argument. Run `dmesg' for more information." How should I use it ?

Anyway, Baidu seems to have disappeared, but that didn't solve the "high %wait IO problem".
On the contrary it went completely crazy, apparently constantly reaching 100% and preventing me from accessing the server via http or even SSH. I had no access to my websites neither to Plesk or to the server via Putty.

The only thing I could do was restart the server (that would then live for a minutes before being choked by %wa and being inaccessible again), or launch the 'Rescue mode'.
As I thought this "%wait IO problem" came from a problem on my system, I resintalled completely my websites : downloaded the content of the server this week end, erased everything and reinstalled Cent0S6 + Plesk. After several days struggling to get all my files back (and it's not finished yet) ... the same problem occurs !! As soon as I recreated the MySQL database and their users, the %wait IO reaches 100% and slows down everything, so that I can't access Plesk or the server via SSH anymore.
Even worse, I lost several files in the process or transferring my website, and am still struggling to retrieve all my emails ... I am desperate ...

Now, the only possibility for me to work on the server is to restart the server and as soon as it's started, shut down mysql with 'service mysqld stop' via Putty.

So I am afraid this problem can't be solved anymore with little changes. I would rather try to block everything, and gradually let a few pass through, to see at which point the %waitIO suddenly reaches 100%.
But I am afraid administering a server is too complicated for me ... after the failure of the reinstall, the only really easy solution I see is to get back to a VPS on HostGator where I had a real support and less problems... :-(

But if you have any idea for more "agressive" solutions, or tests to do to better localize the problem, I am willing to try !

Note : as for SCP, I am now using WinSCP, I understand it acts similarly as the 'scp' command ?

Linux_Kidd · 04-16-2012, 10:33 AM

what do the sar stats say, or even top?

Noway2 · 04-16-2012, 10:41 AM

Quote:

the same problem occurs !! As soon as I recreated the MySQL database and their users, the %wait IO reaches 100% and slows down everything, so that I can't access Plesk or the server via SSH anymore.

I am wondering if there isn't something else at work here. Your statement about "as soon as you recreated the database and their users, the system went into high I/O wait. If you stop Apache or even throw up a firewall to block everything except your admin traffic does it remain in high IO wait? If this seems to have a direct impact, we could try rate limiting connections to your web server as a temporary diagnostic. What I am getting at is, we have been assuming that this is caused by connection activity to your server, but I am starting to wonder.

I also did a little Googling regarding high IO wait, mysql, and Centos 6 and received a fairly large number of potential return hits. I will admit that this level of database tuning is beyond my level of expertise to where I could say "run this test, do xyz, etc", but it looks like there are some parameters related to the key size that you may want to look into.

Thinking along these lines, did this problem appear after a certain point in time that you can define, such as after a particular update or something that may help narrow down what we are looking for?

gcat · 04-17-2012, 10:17 AM

Ok ... now that's really weird !
After 4-5 days down, and at the point where I thought about quitting everything, my website is finally perfectly fine !! (or at least it seems for now).

So, to summarize everything (if it can be of any help for others ...) :
- as soon as I started the server, it got overwhelmed by a huge 'wait IO' load that blocked any other connection, less than 30 seconds after the server was reboot.
- the only first means I had to control the server was restart it (from my 1&1 client panel) and immediately stop mysql with service mysqld stop
- from this state, I noticed that the wait IO did not increase (stayed at 0%) when I changed the name of the directory containing my main database in /var/lib/mysql/nameofmydatabase . So this meant the problem was in my tables !
- so I manually took all my tables (.frm, .MYD, .MYI) out of the folder of my database, and put them back one by one, to see at what point the 'waitIO' would increase dramatically
- oddly enough, the previous 'huge increase of wait IO' didn't happen clearly, but I saw there was a problem with one my tables that seemed highly corrupted. I repaired it, transferred the rest of the tables back to the folder of my database, and everything was right !!
(and at this moment you feel really stupid, as I lost nearly a week trying anything that came to my mind, reformatting the server, losing lots of data, ...)

To answer to you, Noway, yes I can pinpoint the beginning of this mess precisely : as we see on the "CPU usage" image in my first post (http://img27.imageshack.us/img27/3989/imagesga.png), the wait IO started to increase with no apparent reason on April 1st approximately. However I can't connect it to any particular event. The following peaks that occured during one week regularly, and then increasingly until the server was completely crashed, were not related to any event.
I however noticed a high wait-IO last month when a particular file was read by CRON, as this caused an intense work for MySQL.

Anyway, I'm tired of this nonsense, none of this makes sense to me haha, but it finally seems that my CPU load is stable, with the %wait IO below 5% !
Thanks again for all your help, and even if it did not help to solve the problem, you taught me a lot, which I will use to make my server 'cleaner' and to better manage it !