LinuxQuestions.org - Server load gets really high...

- Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)

- - Server load gets really high... (https://www.linuxquestions.org/questions/linux-software-2/server-load-gets-really-high-794777/)

Server load gets really high...

So I've done some reading about how to understand the stats that the top command gives you and I am fairly confident that my problem is an I/O problem. As the wa value when my server load goes through the roof is generally in the 90%+ range.

So then I used the vmstats and ifconfig to see if it was a disk problem and/or a network problem, but I'm not sure what is considered "High values" when I am looking at this data.

vmstats

Code:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

 r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa st

 1  1 1034092  20608  4536  94468    5    3  214    53    8    7  5  1 92  3  0

I am pretty sure the bi and bo values are the values I need to be interested in. Granted this print isn't during the high server load, but so I am going to use this as a base now but what would be considered high? If it was twice as high as this, is that a problem?

ifconfig

Code:

eth0      Link encap:Ethernet  HWaddr 00:30:48:B8:E5:04

          inet addr:64.34.170.212  Bcast:64.34.170.255  Mask:255.255.255.192

          inet6 addr: fe80::230:48ff:feb8:e504/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:516255425 errors:2 dropped:18 overruns:0 frame:2

          TX packets:802790881 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:100

          RX bytes:2195224972 (2.0 GiB)  TX bytes:2843031510 (2.6 GiB)

          Memory:d0200000-d0220000



eth0:1    Link encap:Ethernet  HWaddr 00:30:48:B8:E5:04

          inet addr:64.34.214.184  Bcast:64.34.214.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          Memory:d0200000-d0220000



lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          inet6 addr: ::1/128 Scope:Host

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:3467295 errors:0 dropped:0 overruns:0 frame:0

          TX packets:3467295 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:1000808432 (954.4 MiB)  TX bytes:1000808432 (954.4 MiB)

Now this is a little more complicated, but I think I am searching for the RX packets and TX packets which are currently 516255425 and 802790881 respectively. Now just looking at those numbers, one would assume that they are extremely high. However, my server load at the time of this print was only around .70 w/ wa of 20%.

Well this didn't take very long.

top

Code:

top - 15:16:55 up 27 days, 13:08,  2 users,  load average: 24.93, 16.97, 9.20

Tasks: 195 total,  1 running, 189 sleeping,  0 stopped,  5 zombie

Cpu(s):  1.2%us,  0.5%sy,  0.0%ni,  0.0%id, 97.7%wa,  0.2%hi,  0.5%si,  0.0%st

Mem:  1033652k total,  1021336k used,    12316k free,    5528k buffers

Swap:  2096472k total,  1160388k used,  936084k free,    90732k cached



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

28171 nobody    15  0 25388 8136 2076 S  1.0  0.8  0:00.17 httpd

 2528 mysql    15  0  164m  16m 2668 S  0.7  1.7 200:34.34 mysqld

26191 nobody    16  0 31876 7544 1528 D  0.3  0.7  0:07.69 spamd

28166 root      34  19 23364  10m 4616 D  0.3  1.1  0:00.17 yum-updatesd-he

28260 nobody    16  0 25204 7960 2168 D  0.3  0.8  0:00.07 httpd

28265 nobody    15  0 24316 6940 2168 S  0.3  0.7  0:00.08 httpd

    1 root      15  0  2064  348  316 S  0.0  0.0  0:01.79 init

    2 root      RT  -5    0    0    0 S  0.0  0.0  0:00.73 migration/0

    3 root      34  19    0    0    0 S  0.0  0.0  0:00.62 ksoftirqd/0

    4 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/0

    5 root      RT  -5    0    0    0 S  0.0  0.0  0:04.93 migration/1

    6 root      34  19    0    0    0 S  0.0  0.0  0:04.31 ksoftirqd/1

    7 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/1

    8 root      10  -5    0    0    0 S  0.0  0.0  0:00.01 events/0

    9 root      10  -5    0    0    0 S  0.0  0.0  0:00.04 events/1

  10 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 khelper

  11 root      17  -5    0    0    0 S  0.0  0.0  0:00.00 kthread

  15 root      10  -5    0    0    0 S  0.0  0.0  0:08.16 kblockd/0

  16 root      10  -5    0    0    0 S  0.0  0.0  0:01.03 kblockd/1

  17 root      14  -5    0    0    0 S  0.0  0.0  0:00.00 kacpid

  137 root      14  -5    0    0    0 S  0.0  0.0  0:00.00 cqueue/0

  138 root      14  -5    0    0    0 S  0.0  0.0  0:00.00 cqueue/1

  141 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 khubd

  143 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 kseriod

  213 root      10  -5    0    0    0 D  0.0  0.0  5:29.13 kswapd0

  214 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 aio/0

  215 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 aio/1

  374 root      11  -5    0    0    0 S  0.0  0.0  0:00.00 kpsmoused

  403 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 ata/0

  404 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 ata/1

  405 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 ata_aux

  409 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_0

  410 root      11  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_1

  411 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_2

  412 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_3

  419 root      12  -5    0    0    0 S  0.0  0.0  0:00.00 kstriped

  432 root      10  -5    0    0    0 D  0.0  0.0  3:00.81 kjournald

  458 root      10  -5    0    0    0 S  0.0  0.0  0:00.09 kauditd

  490 root      14  -4  2252  252  248 S  0.0  0.0  0:00.05 udevd

 1236 root      19  0  7428  740  628 S  0.0  0.1  0:00.62 authProg

vmstat

Code:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

 r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa st

 1 20 1159932  14208  5284  96364    5    3  214    54    0    8  5  1 91  3  0

ifconfig

Code:

eth0      Link encap:Ethernet  HWaddr 00:30:48:B8:E5:04

          inet addr:64.34.170.212  Bcast:64.34.170.255  Mask:255.255.255.192

          inet6 addr: fe80::230:48ff:feb8:e504/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:517130158 errors:2 dropped:18 overruns:0 frame:2

          TX packets:804363987 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:100

          RX bytes:2256818114 (2.1 GiB)  TX bytes:859854741 (820.0 MiB)

          Memory:d0200000-d0220000



eth0:1    Link encap:Ethernet  HWaddr 00:30:48:B8:E5:04

          inet addr:64.34.214.184  Bcast:64.34.214.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          Memory:d0200000-d0220000



lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          inet6 addr: ::1/128 Scope:Host

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:3471387 errors:0 dropped:0 overruns:0 frame:0

          TX packets:3471387 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:1001311451 (954.9 MiB)  TX bytes:1001311451 (954.9 MiB)

I don't see a problem, can someone point it out to me?

I also just realized that yum is doing updates. Could that cause a large server load? I'd like to turn that off if so.

At the time of the second measurement the load average was 24.93 but no application apparently maxing out RAM or CPU, but with 1GB swap being used and a 97.7% wait state you have to search for the bottleneck in a different way. Rebooting the machine returns the system to a "known good" state, and then running 'atop', storing data continuously and over a longer period, could help to trace back peaks and narrow down to processes more easily. (Also see 'dstat', 'collectl', 'atsar', SAR.) It would also be interesting to know more HW and SW (services mainly) specs, any anomalies in system or daemon logs and if this behaviour started at some point (SW installation? updates?, configuration changes?).

See those status "D" tasks ? - they are all counted in loadavg.
And they are probably all waiting on disk I/O. Looks like you have a under/badly configured disk farm. Either get some more devices or manage the things that are going to exacerbate the situation. Don't run a yum update against updatedb say ...

Well I attempted to reboot the server, but it's having a difficult time coming back on. When it did finally come back on, it took forever for me to login. Once I did login, the server load was already at 0.54, 2.21, 1.35 so something is defiantly wrong here. Then the server suddenly went down again for a reboot (I'm thinking it did this because after a few minutes of the server not coming back on, I went to my Data center's control panel and initiated a reboot from it, so I think it was just delaying the message) so now I am waiting on it to come back online again.

Server came back online and the server load is still high.

Code:

top - 00:46:18 up 9 min,  2 users,  load average: 2.49, 3.34, 1.59

Tasks: 145 total,  1 running, 139 sleeping,  0 stopped,  5 zombie

Cpu(s):  0.0%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st

Mem:  1033652k total,  842604k used,  191048k free,    27480k buffers

Swap:  2096472k total,        0k used,  2096472k free,  477972k cached



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

 4440 root      15  0  2196 1060  808 R  0.3  0.1  0:00.01 top

    1 root      15  0  2064  640  548 S  0.0  0.1  0:00.41 init

    2 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 migration/0

    3 root      34  19    0    0    0 S  0.0  0.0  0:00.00 ksoftirqd/0

    4 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/0

    5 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 migration/1

    6 root      34  19    0    0    0 S  0.0  0.0  0:00.00 ksoftirqd/1

    7 root      RT  -5    0    0    0 S  0.0  0.0  0:00.00 watchdog/1

    8 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 events/0

    9 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 events/1

  10 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 khelper

  11 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 kthread

  15 root      10  -5    0    0    0 S  0.0  0.0  0:00.02 kblockd/0

  16 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 kblockd/1

  17 root      14  -5    0    0    0 S  0.0  0.0  0:00.00 kacpid

  137 root      14  -5    0    0    0 S  0.0  0.0  0:00.00 cqueue/0

  138 root      15  -5    0    0    0 S  0.0  0.0  0:00.00 cqueue/1

  141 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 khubd

  143 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 kseriod

  211 root      18  0    0    0    0 S  0.0  0.0  0:00.00 pdflush

  212 root      15  0    0    0    0 S  0.0  0.0  0:00.00 pdflush

  213 root      10  -5    0    0    0 S  0.0  0.0  0:01.02 kswapd0

  214 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 aio/0

  215 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 aio/1

  373 root      11  -5    0    0    0 S  0.0  0.0  0:00.00 kpsmoused

  403 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 ata/0

  404 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 ata/1

  405 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 ata_aux

  409 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_0

  410 root      11  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_1

  411 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_2

  412 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 scsi_eh_3

  419 root      13  -5    0    0    0 S  0.0  0.0  0:00.00 kstriped

  432 root      10  -5    0    0    0 S  0.0  0.0  0:00.16 kjournald

  458 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 kauditd

  490 root      18  -4  2252  672  404 S  0.0  0.1  0:00.05 udevd

 1483 root      20  -5    0    0    0 S  0.0  0.0  0:00.00 kmpathd/0

 1484 root      20  -5    0    0    0 S  0.0  0.0  0:00.00 kmpathd/1

 1485 root      20  -5    0    0    0 S  0.0  0.0  0:00.00 kmpath_handlerd

 1583 root      11  -5    0    0    0 S  0.0  0.0  0:00.00 kjournald

 1795 root      0 -20    0    0    0 S  0.0  0.0  0:00.01 loop0

 1796 root      10  -5    0    0    0 S  0.0  0.0  0:00.00 kjournald

 2073 root      15  -4 12516  768  576 S  0.0  0.1  0:00.00 auditd

Code:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------

 r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa st

 0  0      0 191692  27548 477972    0    0  4054    63  692  376  8  2 50 40  0

Code:

eth0      Link encap:Ethernet  HWaddr 00:30:48:B8:E5:04

          inet addr:64.34.170.212  Bcast:64.34.170.255  Mask:255.255.255.192

          inet6 addr: fe80::230:48ff:feb8:e504/64 Scope:Link

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          RX packets:7945 errors:0 dropped:3867 overruns:0 frame:0

          TX packets:7360 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:100

          RX bytes:4485177 (4.2 MiB)  TX bytes:4272684 (4.0 MiB)

          Memory:d0200000-d0220000



eth0:1    Link encap:Ethernet  HWaddr 00:30:48:B8:E5:04

          inet addr:64.34.214.184  Bcast:64.34.214.255  Mask:255.255.255.0

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

          Memory:d0200000-d0220000



lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          inet6 addr: ::1/128 Scope:Host

          UP LOOPBACK RUNNING  MTU:16436  Metric:1

          RX packets:216 errors:0 dropped:0 overruns:0 frame:0

          TX packets:216 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:35596 (34.7 KiB)  TX bytes:35596 (34.7 KiB)

Maybe I don't understand, but the commands atop, dstat, collectl, atsar did not work.

I do have more than one disk on the server, a 500GB primary and 250GB secondary.

My hardware is:

Intel Core2Duo E6750 DC
1GB DDR2 667
250GB SATA HDD
500GB SATA HDD

My software is:
CENTOS 5.3
cPanel 11.24.5-R38506 - WHM 11.24.2 - X 3.9
Along with those.. I also have two Unreal Tournament 10 person servers hosted on the server (hardly ever have any players) and a TeamSpeak 3 server (hasn't seen activity at all this month)

Try this from a terminal and post the (full) output

Code:

top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'

Code:

root@server2 [~]# top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'

top - 01:16:03 up 39 min,  2 users,  load average: 0.11, 0.15, 0.35

Tasks: 161 total,  1 running, 155 sleeping,  0 stopped,  5 zombie

Cpu(s):  3.0%us,  0.6%sy,  0.0%ni, 84.7%id, 11.5%wa,  0.0%hi,  0.2%si,  0.0%st

Mem:  1033652k total,  990812k used,    42840k free,    19832k buffers

Swap:  2096472k total,        0k used,  2096472k free,  571940k cached



  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

Total status D:

Will make a note that the server load right now is only 0.19, 0.17, 0.34

Run it sometime the numbers - particularly the short one - are upp-ish.

Quote:

Originally Posted by syg00 (Post 3895355)

Run it sometime the numbers - particularly the short one - are upp-ish.

I will do this, I'm thinking this script is just looking to see how many processes are in the "D" state, since you mentioned that statuses in "D" state are all totaled in with the system load. Right?

Yep - merely (circumstantial) evidence; but might help.

Something I have just noticed. When I log into the server, though SSH/Putty it takes FOREVER. Like the "Login as:" text pops up instantly, I enter my username, then the password prompt appears immediately then when I enter my password it takes a really, really long time before it goes though. Like at least a minute to a minute and a half.

Usually it logs in faster than I can type.

Quote:

Originally Posted by Skillz (Post 3895321)

the commands atop, dstat, collectl, atsar did not work.

That's because you have to install them before you can use them. They should be in the default Centos repo or else RPMForge or EPEL.

- Are the two UT servers and the TS3 server the only publicly accessible services running? If not, what other services mainly run?
- Is cPanel (and maybe related paths on the server like /phpmyadmin?) only accessible from your management IP or IP range?
- Do the system or daemon logs show any "odd" lines involving 'links', 'wget' or any network tools?
- Are there by any chance oddly named files in your /tmp, /var/tmp or Apache docroot?
- Did this load problem start right from using the server or at some point? If the latter, can you trace back what happened at that point in terms of HW changes, SW installation or updates, reconfiguration?.

Quote:

Originally Posted by unSpawn (Post 3895953)

Yea, I realized that after I posted. I went Googling. Still not 100% sure on how to install them. I tried yum install atop but it didn't work.

No, the other service is a FTP server. The one that runs for cPanel, it also has a "public login" that is posted on one of my sites for people to upload specific files to. I monitor it daily, with logs that are emailed to me the people who login to it and what they do. Doesn't really get that much traffic.

Those things are only accessible through cpanel. You have to login to get to them.

What logs can I look at for those messages, because I use wget often to copy things to my server that are otherwise too large for me to try to download then FTP.

Files in my /tmp:
Buch of files that look similar to this; sess_381b2d464edc56d83b9026b9fa50d0dc then
.ICE-unix/
lost+found/
mysql.sock@
spamd-9952-init/

Looks like the same files in /var/tmp

Not sure where the apache doc root is?

No, the problem seems to happen every once in a while though it has seemed to become a bit more frequent. When I first got the server, I never noticed it. Then sometimes I'd notice the server load get really high, but then it would go away. I always assumed it was the Unreal Tournament servers (I had 5 running at one point plus a BF2 Demo server) but when I shut them down, the load didn't go away.

I am really, really thinking it might have something to do with Apache though. Not sure if it's a coincidence or not, but it seems that when the load is high and I shut down the httpd service the load goes back down. This doesn't explain why the server load is really high upon boot though.