LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices



Reply
 
Search this Thread
Old 02-26-2007, 06:10 AM   #1
InDubio
LQ Newbie
 
Registered: Feb 2007
Posts: 10

Rep: Reputation: 0
high IOWait on server when copying files from network


Hy there,

We've got a problem with our SMB fileserver here, every time we copy Data from the network to the server the IOWait time hits the 90% mark the loadavg rises to above 10 and the throughput drops to 3MB/s. Browsing the Filetree via SMB at the same time is virtual impossible (you have to wait 15 to 20sec. before the Windows Explorer shows you the directory content).

Copying Files from the Server is not a problem and is working like charm.
But first here's some information about the Hardware:
The Server is a FSC RX300 S3:
  • 2x P4 Xeon DP 2,8 Ghz
  • 3Gbyte DDR2 Memmory
  • Emulex LightPulse PCIx Fiberchannel HBA
  • Intel e1000 LWL PCIx Network Adapter
  • Distri is Gentoo 2006.1
  • Kernel is a 2.6.18-gentoo-r4
  • connected via Fiberchannel are 2 Compaq SmartArrays (exporting sdb through sdi)
  • sdb to sdi are part of one LVM2 Logical Volume (1,6T) on which the SMB share is residing


The thing is, I can't find the bottleneck here which is causing these high IOWait times. But I was able to count out several possible reasons:
  • Using FTP instead of SMB shows the same symptoms.
  • Using another Hardrive (/dev/sda instead of the logical volume residing on an external SAN connected via Fiberchannel) is a bad idea too.
  • "netcating" a lot of data from a remote Machine to /dev/null is not a problem at all. (15MB/s)
  • Moving Files from one internal SCSI HD to the logical volume is not a problem either.

and here is the output of some progs i ran while copying data to the server:
Code:
$ mpstat 1
11:14:56     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
11:14:57     all    0.50    0.00    2.73   90.57    0.25    0.50    0.00    5.46   4152.48
11:14:58     all    0.25    0.00    2.25   97.00    0.25    0.25    0.00    0.00   5972.28
11:14:59     all    0.75    0.00    1.74   92.54    0.50    1.49    0.00    2.99   7712.00

$ cat /proc/loadavg
9.03 6.44 3.14 1/346 27174
And now I'm out of ideas here. Every part of the system seems to work fine on its own. (The network, the Fiberchannel, the internal SCSI) Even working together seems to work UNLESS you try to put data from the network to one harddrive.

So maybe somebody has got an idea where the problem might be.
 
Old 03-10-2007, 10:43 AM   #2
Slim Backwater
Member
 
Registered: Nov 2005
Distribution: Slackware 10.2 2.6.20
Posts: 68

Rep: Reputation: 15
Quote:
Originally Posted by InDubio
...every time we copy Data from the network to the server the IOWait time hits the 90% mark the loadavg rises to above 10 and the throughput drops to 3MB/s.
...
Copying Files from the Server is not a problem and is working like charm.
  • Emulex LightPulse PCIx Fiberchannel HBA
  • Intel e1000 LWL PCIx Network Adapter
My guess is PCI bus contention. A machine like that will likely have multiple PCI busses. Are the two PCIx cards on the same PCI bus? If they are not, you could try the opposite and put them on the same bus.

While my experience is with Copper Intel Pro/1000 GT, See if the E1000 is generating excessive interrupts. Check out:

/usr/src/linux/Documentation/networking/e1000.txt
and
http://support.intel.com/support/net...o100/21397.htm

for details on the InterruptThrottleRate option to the e1000 module.

you can watch your interrupts by running `vmstat 5` in another window while a download is in progress.

Also, try making a RAM disk and download to it:
Code:
mkdir /mnt/ram
mount -t ramfs nothing /mnt/ram
and to be complete, time the copy from the ramdisk to the hard drive.

HTH
 
Old 03-12-2007, 04:10 AM   #3
InDubio
LQ Newbie
 
Registered: Feb 2007
Posts: 10

Original Poster
Rep: Reputation: 0
Well first, thanks for the reply.

Second, how do I see if the two card's are connected to the same PCI-Bus?
I tried "lspci -t".
Code:
-[0000:00]-+-00.0
           +-02.0-[0000:01-03]--+-00.0-[0000:02]--+-08.0
           |                    |                 \-08.1
           |                    \-00.2-[0000:03]--
           +-04.0-[0000:04]----00.0
           +-05.0-[0000:05]----00.0
           +-06.0-[0000:06-08]--+-00.0-[0000:07]----01.0
           |                    \-00.2-[0000:08]----01.0
           +-1d.0
           +-1d.1
           +-1d.2
           +-1d.3
           +-1d.7
           +-1e.0-[0000:09]----05.0
           +-1f.0
           +-1f.1
           \-1f.3
Where 08:01.0 is the FC-HBA and 07:01.0 is the Intel Pro 1000.
Does that indicate they are both on the same PCI-Bus?
If so than maybe that's not the Problem because i also tried the onboard SCSI-HBA (PCIid: 02:08.0), which resulted in the same problem.
I also switched from the Intel Pro 1000 to the onboard Broadcom 1GB copper card (PCIid: 04:00.0), still with the same symptoms.
(All that is, if the above tree view really shows the different PCI-Buses)

I tried an messed around a bit with the TCP Congestion control protocol an various other TCP-Stack "optimizations" like changing the receive buffer size, which seems to soften the problem a bit.
that's what i changed:
Code:
echo "reno" > /proc/sys/net/ipv4/tcp_congestion_control		
echo 1 > /proc/sys/net/ipv4/tcp_no_metrics_save			
echo 16777216 > /proc/sys/net/core/rmem_max			
echo 16777216 > /proc/sys/net/core/wmem_max			
echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_rmem	
echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_wmem
I still have the problem with the copying "stalling" every now and then for about 20 to 30sec. but the server load avg. now keeps below 3.

I will try and change to a 2.6.20 Kernel when we can afford for a little down time (hope that will be this week and maybe I messed something up in the kernel config).
Oh and i remembered that we once hat VMWare running on that server maybe the vmnet and vmbridge modules are messing up with the internal network handling. Will get rid of them once i changed the kernel.

I nearly forgot: The ramdisk test:
I copied a 561MB file from the Network to the ramdisk, first got around 9MB/s which dropped to 4MB/s after about 300MB. The interrupts where around 4000 while copying.
Moving that file to the FC-Disk took 5.590sec (!). Which seems to be quite fast.

Again thanks for your suggestions
 
  


Reply

Tags
iowait, load average


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
high iowait in RHES sklam Red Hat 20 12-14-2006 01:35 PM
High Load, Low CPU/RAM/iowait ? newlinuxnewbie Linux - Server 1 09-22-2006 10:24 AM
High loadavg, low CPU & iowait reichhartkg Linux - General 1 07-28-2006 04:44 PM
CPUs in high IOwait state despite of lack of load kvsraju Linux - Enterprise 3 11-03-2005 04:34 PM
Copying files through a network. Denwar Mandriva 1 04-13-2004 08:30 AM


All times are GMT -5. The time now is 10:13 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration