high IOWait on server when copying files from network

InDubio · 02-26-2007, 05:10 AM

Hy there,

We've got a problem with our SMB fileserver here, every time we copy Data from the network to the server the IOWait time hits the 90% mark the loadavg rises to above 10 and the throughput drops to 3MB/s. Browsing the Filetree via SMB at the same time is virtual impossible (you have to wait 15 to 20sec. before the Windows Explorer shows you the directory content).

Copying Files from the Server is not a problem and is working like charm.
But first here's some information about the Hardware:
The Server is a FSC RX300 S3:

2x P4 Xeon DP 2,8 Ghz
3Gbyte DDR2 Memmory
Emulex LightPulse PCIx Fiberchannel HBA
Intel e1000 LWL PCIx Network Adapter
Distri is Gentoo 2006.1
Kernel is a 2.6.18-gentoo-r4
connected via Fiberchannel are 2 Compaq SmartArrays (exporting sdb through sdi)
sdb to sdi are part of one LVM2 Logical Volume (1,6T) on which the SMB share is residing

The thing is, I can't find the bottleneck here which is causing these high IOWait times. But I was able to count out several possible reasons:

Using FTP instead of SMB shows the same symptoms.
Using another Hardrive (/dev/sda instead of the logical volume residing on an external SAN connected via Fiberchannel) is a bad idea too.
"netcating" a lot of data from a remote Machine to /dev/null is not a problem at all. (15MB/s)
Moving Files from one internal SCSI HD to the logical volume is not a problem either.

and here is the output of some progs i ran while copying data to the server:

Code:

$ mpstat 1
11:14:56     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
11:14:57     all    0.50    0.00    2.73   90.57    0.25    0.50    0.00    5.46   4152.48
11:14:58     all    0.25    0.00    2.25   97.00    0.25    0.25    0.00    0.00   5972.28
11:14:59     all    0.75    0.00    1.74   92.54    0.50    1.49    0.00    2.99   7712.00

$ cat /proc/loadavg
9.03 6.44 3.14 1/346 27174

And now I'm out of ideas here. Every part of the system seems to work fine on its own. (The network, the Fiberchannel, the internal SCSI) Even working together seems to work UNLESS you try to put data from the network to one harddrive.

So maybe somebody has got an idea where the problem might be.

Slim Backwater · 03-10-2007, 09:43 AM

Quote:

Originally Posted by InDubio

...every time we copy Data from the network to the server the IOWait time hits the 90% mark the loadavg rises to above 10 and the throughput drops to 3MB/s.
...
Copying Files from the Server is not a problem and is working like charm.

Emulex LightPulse PCIx Fiberchannel HBA
Intel e1000 LWL PCIx Network Adapter

My guess is PCI bus contention. A machine like that will likely have multiple PCI busses. Are the two PCIx cards on the same PCI bus? If they are not, you could try the opposite and put them on the same bus.

While my experience is with Copper Intel Pro/1000 GT, See if the E1000 is generating excessive interrupts. Check out:

/usr/src/linux/Documentation/networking/e1000.txt
and
http://support.intel.com/support/net...o100/21397.htm

for details on the InterruptThrottleRate option to the e1000 module.

you can watch your interrupts by running `vmstat 5` in another window while a download is in progress.

Also, try making a RAM disk and download to it:

Code:

mkdir /mnt/ram
mount -t ramfs nothing /mnt/ram

and to be complete, time the copy from the ramdisk to the hard drive.

HTH

InDubio · 03-12-2007, 03:10 AM

Well first, thanks for the reply.

Second, how do I see if the two card's are connected to the same PCI-Bus?
I tried "lspci -t".

Code:

-[0000:00]-+-00.0
           +-02.0-[0000:01-03]--+-00.0-[0000:02]--+-08.0
           |                    |                 \-08.1
           |                    \-00.2-[0000:03]--
           +-04.0-[0000:04]----00.0
           +-05.0-[0000:05]----00.0
           +-06.0-[0000:06-08]--+-00.0-[0000:07]----01.0
           |                    \-00.2-[0000:08]----01.0
           +-1d.0
           +-1d.1
           +-1d.2
           +-1d.3
           +-1d.7
           +-1e.0-[0000:09]----05.0
           +-1f.0
           +-1f.1
           \-1f.3

Where 08:01.0 is the FC-HBA and 07:01.0 is the Intel Pro 1000.
Does that indicate they are both on the same PCI-Bus?
If so than maybe that's not the Problem because i also tried the onboard SCSI-HBA (PCIid: 02:08.0), which resulted in the same problem.
I also switched from the Intel Pro 1000 to the onboard Broadcom 1GB copper card (PCIid: 04:00.0), still with the same symptoms.
(All that is, if the above tree view really shows the different PCI-Buses)

I tried an messed around a bit with the TCP Congestion control protocol an various other TCP-Stack "optimizations" like changing the receive buffer size, which seems to soften the problem a bit.
that's what i changed:

Code:

echo "reno" > /proc/sys/net/ipv4/tcp_congestion_control		
echo 1 > /proc/sys/net/ipv4/tcp_no_metrics_save			
echo 16777216 > /proc/sys/net/core/rmem_max			
echo 16777216 > /proc/sys/net/core/wmem_max			
echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_rmem	
echo "4096 87380 16777216" > /proc/sys/net/ipv4/tcp_wmem

I still have the problem with the copying "stalling" every now and then for about 20 to 30sec. but the server load avg. now keeps below 3.

I will try and change to a 2.6.20 Kernel when we can afford for a little down time (hope that will be this week and maybe I messed something up in the kernel config).
Oh and i remembered that we once hat VMWare running on that server maybe the vmnet and vmbridge modules are messing up with the internal network handling. Will get rid of them once i changed the kernel.

I nearly forgot: The ramdisk test:
I copied a 561MB file from the Network to the ramdisk, first got around 9MB/s which dropped to 4MB/s after about 300MB. The interrupts where around 4000 while copying.
Moving that file to the FC-Disk took 5.590sec (!). Which seems to be quite fast.

Again thanks for your suggestions