LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices



Reply
 
Search this Thread
Old 12-16-2008, 05:29 PM   #1
jdw52
LQ Newbie
 
Registered: Aug 2008
Posts: 26

Rep: Reputation: 15
Need help troubleshooting high load averages


I've got a linux box that runs OTRS (our ticket system) and Echovnc (our VNC server software). This has been running with no visible issues for 199 days.

Today I decided to expand the functionality of our ticket system and add email support. I noticed a weird issue in my maillog:

Code:
Dec 16 16:47:27 support sendmail[2532]: rejecting connections on daemon MTA: load average: 28
While researching this issue I found out that this is sendmails nice way of not overloading a busy server. Of course that gives me a much bigger issue to troubleshoot.

I ended up rebooting the server just to see if that would fix the issue. Unfortunately the load averages shot right up after the reboot:

Code:
[root@support linux-i386]# w
17:05:58 up 58 min,  1 user,  load average: 40.64, 35.82, 36.64
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
chchadmi pts/1    10.21.1.9        16:08    0.00s  0.14s  0.02s sshd: chchadmin [priv]
Here is what top gives me:

Code:
top - 17:03:29 up 55 min,  1 user,  load average: 55.34, 38.28, 37.79
Tasks:  99 total,   3 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.3%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1945372k total,   479352k used,  1466020k free,    66596k buffers
Swap:  8193128k total,        0k used,  8193128k free,   311316k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    1 root      15   0  2040  628  544 S  0.0  0.0   0:00.28 init
    2 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
    4 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
    5 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 events/0
    6 root      15  -5     0    0    0 S  0.0  0.0   0:00.00 khelper
    7 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kthread
   10 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kblockd/0
   11 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 kacpid
  139 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 cqueue/0
  142 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 khubd
  144 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kseriod
  208 root      25   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
  209 root      15   0     0    0    0 S  0.0  0.0   0:00.02 pdflush
  210 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 kswapd0
  211 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 aio/0
  365 root      11  -5     0    0    0 S  0.0  0.0   0:00.00 kpsmoused
  390 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 ata/0
  391 root      18  -5     0    0    0 S  0.0  0.0   0:00.00 ata_aux
  394 root      16  -5     0    0    0 S  0.0  0.0   0:00.01 scsi_eh_0
  395 root      12  -5     0    0    0 S  0.0  0.0   0:00.01 scsi_eh_1
  404 root      15  -5     0    0    0 S  0.0  0.0   0:00.00 ksnapd
  407 root      10  -5     0    0    0 S  0.0  0.0   0:00.89 md1_raid1
  410 root      10  -5     0    0    0 S  0.0  0.0   0:00.07 md0_raid1
  411 root      19  -5     0    0    0 S  0.0  0.0   0:00.07 kjournald
  439 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kauditd
  473 root      20  -4  2220  636  384 S  0.0  0.0   0:00.43 udevd
  764 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kedac
 1620 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 kmpathd/0
 1674 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kjournald
 1682 root      10  -5     0    0    0 S  0.0  0.0   0:00.21 kjournald
 2064 root      16  -3 12064  656  488 S  0.0  0.0   0:00.01 auditd
 2066 root      14  -3 10100 3924 2284 S  0.0  0.2   0:00.07 python
 2089 root      15   0  1700  588  496 S  0.0  0.0   0:00.08 syslogd
 2092 root      15   0  1652  400  336 S  0.0  0.0   0:00.00 klogd
 2125 rpc       24   0  1792  560  464 S  0.0  0.0   0:00.00 portmap
 2154 root      24   0  1804  728  624 S  0.0  0.0   0:00.00 rpc.statd
 2166 root      18   0  1804  296  204 S  0.0  0.0   0:00.00 mdadm

Disk utilization:

Code:
[root@support linux-i386]# iostat -d
Linux 2.6.18-53.el5 (support.xxxxx.xxx)   12/16/2008

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               8.41       109.31       177.24     395988     642056
sdb               8.24        78.93       177.24     285934     642056
md0              11.30       177.09        43.97     641498     159288
md1              16.32        10.00       124.39      36228     450608
dm-0              0.16         1.59         0.45       5746       1624
dm-1             16.14         8.24       123.94      29858     448984

I have plenty of disk space:

Code:
[root@support linux-i386]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               24G  1.9G   21G   9% /
/dev/mapper/VolGroup00-LogVol01
                       28G  476M   26G   2% /home
/dev/mapper/VolGroup00-LogVol00
                       90G  396M   85G   1% /var
tmpfs                 950M     0  950M   0% /dev/shm
 
Old 12-16-2008, 05:36 PM   #2
jdw52
LQ Newbie
 
Registered: Aug 2008
Posts: 26

Original Poster
Rep: Reputation: 15
If anyone could help guide me in my troubleshooting process, I'd appreciate it. I'm definitely not a linux guru. I manage a support department on a shoe string budget so linux and open source are a god send to me.

BTW, I'm mostly interested in refining my troubleshooting process. I looked at load average and then at top and couldn't seem to reconcile the two. There is an explanation to this but I'm not experienced enough to spot it.
 
Old 12-16-2008, 06:21 PM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,491

Rep: Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077
Bumping your own thread after 7 minutes won't help - it'll likely just piss people off.
The high loadavg doesn't appear to be hurting - but obviously if sendmail won't do its job, that's an issue. Run this for some better info
Code:
top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'
 
Old 12-17-2008, 08:29 AM   #4
jdw52
LQ Newbie
 
Registered: Aug 2008
Posts: 26

Original Poster
Rep: Reputation: 15
It wasn't my intention to "bump" my own post. I was simply breaking my comments out into a separate post because the first was so long. In the future I'll keep everything in one post.

Here is the result of the command you asked me to run. I don't think this is everything you were looking for?

Code:
[root@support chchadmin]# top -b -n 1 | awk '{if (NR <=7) print; else if ($8 == "D") {print; count++} } END {print "Total status D: "count}'
top - 08:09:42 up 16:02,  1 user,  load average: 59.62, 44.35, 42.82
Tasks:  96 total,   2 running,  94 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.7%us,  0.3%sy,  0.0%ni, 96.9%id,  0.1%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   1945372k total,   619092k used,  1326280k free,   107700k buffers
Swap:  8193128k total,        0k used,  8193128k free,   375456k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
Total status D:
 
Old 12-17-2008, 04:59 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,491

Rep: Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077
Now that is starting to get interesting. Been a while since I looked at the code that generates the loadavg numbers, but I reckon that shouldn't be happening. I wonder if that I/O load is "bursty" and we just missed any waiting tasks.
I'll think on it some more.
 
Old 12-31-2008, 08:27 AM   #6
jdw52
LQ Newbie
 
Registered: Aug 2008
Posts: 26

Original Poster
Rep: Reputation: 15
I just wanted to update this in case some one comes across this in their search. I still haven't found a rhyme or reason for the very high load averages. But I did figure out how to adjust sendmail's configuration file so that the load averages wouldn't prevent it from doing it's job. In sendmail.cf I adjusted the QueueLA & RefuseLA values to be above the load average values most of the time. I kept the RefuseLA value higher than QueueLA.

So my ticket system is sending email out and life is good. I still see no noticeable performance issues with the server. However I'm building a replacement server so that I can migrate over my ticket system database in a pinch if need be.
 
Old 12-31-2008, 07:37 PM   #7
salasi
Senior Member
 
Registered: Jul 2007
Location: Directly above centre of the earth, UK
Distribution: SuSE, plus some hopping
Posts: 3,919

Rep: Reputation: 779Reputation: 779Reputation: 779Reputation: 779Reputation: 779Reputation: 779Reputation: 779
If you don't mind some baseless speculation...

This is weird (in other words, I don't understand, either); you have very high load averages, but your system is spending most of its time idle and not in running useful programs.

So, why isn't it using that high wait percentage to get on with the work in the load queue? I don't know, but if I had to guess, my guess would be that something in the initialisation isn't completing cleanly and for that reason it isn't getting on with subsequent jobs as it should. The trouble is, I can't see any reason why that should be.

My wild guess at this point might be something like a kernel upgrade that you (or a colleague) may have previously done and that you didn't really test thoroughly at the time. After all, the system might have appeared to be working, but unless you actually looked at the load averages once you had started the new kernel running, you wouldn't have seen this odd phenomenon.

Quote:
However I'm building a replacement server so that I can migrate over my ticket system database...
That does seem sensible.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
'Uptime' system load averages miknight Linux - General 14 01-30-2008 03:28 PM
uptime load averages Longinus Linux - Newbie 10 01-28-2005 12:24 AM
Sendmail complaining about load averages nemesisza Linux - Software 1 09-17-2004 08:32 PM
uptime command for load averages CypherSurfer Debian 2 07-14-2004 11:14 PM
BitTorrent seeding / load averages fr0zen Linux - Software 0 01-24-2004 05:36 PM


All times are GMT -5. The time now is 07:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration