LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   CPU Load ran away...NFS issue? (http://www.linuxquestions.org/questions/linux-general-1/cpu-load-ran-away-nfs-issue-39782/)

Linux303 12-30-2002 12:21 PM

CPU Load ran away...NFS issue?
 
Hello all!

Here is my issue.

I have a web server that mounts /home to an Iomega NAS (IDE based) server via NFS. I made a mistake with rm-rf (I know, I know) and wiped out the data. I have a second server that I had built that is connected to a Disk Array. This server is loaded and is all SCSI based and kicks butt. It is running software Raid by the way. I was waiting for a chance to move the data to this box anyway so I copied my backups to this server which I will call NAS#2. I then mounted /home to "NAS#2". No problems yet, until I start the web sites. When I started the websites from the web server, the CPU load went through the roof! This is on the web server itself and not NAS#2. I brought all the sites back down and the CPU load dropped back down. So I brought each one up at a time and saw CPU load jump 3-6 points per site. I hope that this would be a temp thing so I let it run. It has now been 12 hours and I am in the 30's. It used to run in the .5-2 range.

Here are my numbers:

[root@root]# vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 33 0 1060 13812 47204 605368 0 0 2 24 416 230 2 1 97
0 33 0 1060 13812 47204 605368 0 0 0 0 207 42 1 0 99
0 33 0 1060 13812 47204 605368 0 0 0 0 203 30 0 1 99
0 33 0 1060 13812 47204 605368 0 0 0 16 213 50 0 2 98
0 33 0 1060 13812 47204 605368 0 0 0 0 209 40 1 0 99
1 33 0 1060 13248 47204 605368 0 0 0 56 3541 2311 25 5 70
0 33 0 1060 13836 47204 605368 0 0 0 288 230 68 4 1 95
0 33 0 1060 13836 47204 605368 0 0 0 0 195 26 1 1 98
0 33 0 1060 14236 47204 605368 0 0 0 36 212 51 0 1 99
0 33 0 1060 14236 47204 605368 0 0 0 0 204 30 1 1 98
0 33 0 1060 14232 47204 605368 0 0 0 0 229 66 0 1 99





Does anyone have any ideas on this?

RedHat 7.3 running 2.4.19

SlickWilly 12-30-2002 02:00 PM

Um.. I'm a bit confused here.

Is NAS2 your webserver now, ie: it replaced NAS1. Or is your web server the same machine and the only thing that's changed is your /home mount (now on a different machine)?

Linux303 12-30-2002 02:11 PM

Sorry about the confusion. The web server is the same machine. It now mounts /home to NAS#2 rather than NAS#1 (the Iomega)

stickman 12-30-2002 02:21 PM

Where are you gettting your CPU load numbers from? Your vmstat data shows idle CPU in the high 90s.

SlickWilly 12-30-2002 02:29 PM

Oh ah..

Actually my fault, I re-read and it became clear.

So, we can rule out things like processor problems, disk (local) issues, and other issues involved with the local machine.

In effect, the only thing that has changed is the network link between the webserver and machine 1, and 2.

Now, looking at your stats I see a heapload of blocked processes (the b in the second column - which is for um.. uninteruptable sleeping processes?). Basically, these processes are all waiting for something to get back to them before continuing on their merry way.

Your memory / swap usage is changing none, so we're not running into paging problems.

your CPU Idle is hangin around 97% (barring the one freak 70%) and so, we're not CPU bound either.

Your block io looks okay too - with a bit of a spike which I expect represents some disk writing after your period of freak activity.

Which leaves only your Interupts...

Low and behold, you've got an average of around 200 or so, and then a spike of 3500. (which is why your cpu gets busy at the same time). This would represent a heapload of data hitting your machine at one time, after which.. relatively little.

I'd like to watch it a bit more, see if it's a regular thing. But I'd bet my last donut that this is what's causing your blocked processes.

It could be a number of things. If you're *certain* nothing has changed on the webserver machine, I would bet that your NS2 is misconfigured.

I would also bet that it's something like your NS2's network card being forced into Full Duplex when your switch is only doing half duplex.. or the other way around.

If I were you I'd load up a traffic monitor (look at iptraf it's a wonderful tool) and watch the interaction between your machines.

Also check ethtool / mii-tool for network card configuration / autonegotiation on the cards. I'm doubting it's a software issue, if only because your vmstat seems to indicate that it's neither cpu/memory or page file related...

Slick.

Linux303 12-30-2002 04:16 PM

Here are the results of ifconfig eth0 for each box:

Webserver:
eth0 Link encap:Ethernet HWaddr 00:02:B3:87:25:20
inet addr:192.168.0.101 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:17508608 errors:0 dropped:0 overruns:0 frame:0
TX packets:5250884 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:4254432883 (4057.3 Mb) TX bytes:998214958 (951.9 Mb)
Interrupt:17


NAS #2:
eth0 Link encap:Ethernet HWaddr 00:02:B3:87:25:26
inet addr:192.168.0.200 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18847020 errors:0 dropped:0 overruns:0 frame:0
TX packets:54615385 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:476347261 (454.2 Mb) TX bytes:3970090258 (3786.1 Mb)
Interrupt:17


No collisions is a good sign

Linux303 12-30-2002 04:33 PM

One more peice of info. Here is a seperate box that has mounts to NAS#2 and runs 1 large website and as well as allot of email:


procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 2312 35604 113896 164980 0 0 2 239 15 447 6 10 83
1 0 0 2312 38448 113900 163960 0 0 0 344 1390 1033 7 23 70
0 0 0 2312 33264 113900 168324 0 0 0 60 4070 2268 6 9 85
0 0 0 2312 39280 113900 162652 0 0 0 108 195 208 4 5 91
0 0 0 2312 39672 113904 162652 0 0 0 276 236 309 10 13 77
0 0 0 2312 40212 113908 162652 0 0 0 264 167 238 2 5 93
1 0 0 2312 38984 113908 162652 0 0 0 76 156 158 3 3 94
0 0 0 2312 39508 113912 162652 0 0 0 156 272 274 2 5 93
1 0 0 2312 36892 113916 164792 0 0 0 212 2147 1304 7 12 81
0 0 0 2312 37616 113916 164796 0 0 0 200 197 243 5 10 85
0 0 0 2312 37828 113924 164800 0 0 0 532 335 617 15 12 73
0 0 0 2312 36836 113928 164800 0 0 0 212 327 410 6 8 86
0 0 0 2312 37548 113928 164800 0 0 0 108 150 122 3 10 87
0 0 0 2312 37600 113928 164800 0 0 0 88 146 150 4 5 91
0 0 0 2312 38096 113928 164804 0 0 0 188 195 178 1 8 91
2 1 1 2312 37304 113932 164804 0 0 0 168 153 247 5 7 88
0 0 0 2312 37820 113936 164804 0 0 0 140 614 482 6 7 87
0 0 0 2312 37108 113940 165228 0 0 0 144 583 389 5 3 92
0 0 0 2312 35720 113944 165684 0 0 0 240 1071 864 6 7 87
0 0 0 2312 37160 113952 164804 0 0 0 228 254 267 5 11 84
0 0 0 2312 36340 113960 164804 0 0 0 232 207 269 4 3 93
0 0 0 2312 35172 113964 164804 0 0 0 228 460 614 7 11 82
0 0 0 2312 36440 113964 164812 0 0 0 596 389 716 5 33 62
0 0 0 2312 36300 113964 164812 0 0 0 200 353 446 6 12 82
0 0 0 2312 35780 113964 164812 0 0 0 680 377 835 10 28 62
0 0 0 2312 36760 113968 164816 0 0 0 460 300 305 7 16 77
0 0 0 2312 36380 113972 164800 0 0 0 168 604 498 4 5 91
1 0 0 2312 36676 113972 164800 0 0 0 56 153 146 4 5 91
0 0 0 2312 37096 113972 164792 0 0 0 112 214 206 1 11 88



Here is w:
3:29pm up 1 day, 1:07, 3 users, load average: 0.89, 0.80, 0.75

This box is a clone of the webserver in hardware.

Linux303 01-06-2003 02:45 PM

Fellow Penguins,

Wanted to let you know I finally resolved this issue. Each web server has it own Apache instance. The boxes are cloned in partitions and OS version but I cannot say for certain that patches were made to both boxes. I recompiled Apache for each site and I am now running at:

load average: 0.05, 0.06, 0.07

I'm thinking it was a shared library issue or something..???

Linux303


All times are GMT -5. The time now is 02:28 AM.