gigabit ethernet performance

stefan_nicolau · 05-21-2008, 06:32 AM

I have recently upgraded my home network to gigE, expecting to get transfer speeds around 50mB/s on nfs. Unfortunately first results fall very short, only around 17mB/s. I have done a bit more testing, and here is what I came up with :

- during nfs transfers, CPU usage on the server is around 60% wait and 40% system, so it seems the network is the problem. CPU usage on the client is around 5%. The server is an IBM dual-p3/500 with 512mB ram. The client is a core 2 duo (3gHz) with 4gB ram.

- disk access on the server can be done at around 40-60mB/s. I expect that the pci bus should handle 50mB/s disk read + 50mB/s gigE (the total is below the bus' 133mB/s limit)

Code:

server: netcat -l -p 1234 > /dev/null
client: dd if=/dev/zero bs=1M count=1000 |netcat server 1234
1048576000 bytes (1.0 GB) copied, 18.0398 s, 58.1 MB/s

Code:

client: netcat -l -p 1234 > /dev/null
server: dd if=/dev/zero bs=1M count=1000 |netcat client 1234
1048576000 bytes (1.0 GB) copied, 26.6001 seconds, 39.4 MB/s

Code:

server: netcat -l -p 1234 > /dev/null
client:  dd if=/dev/md0 bs=1M count=1000 skip=10000 |netcat server 1234
1048576000 bytes (1.0 GB) copied, 19.3558 s, 54.2 MB/s

Code:

client: netcat -l -p 1234 > /dev/null
server: dd if=/dev/sda bs=1M count=1000 skip=10000 |netcat client 1234
1048576000 bytes (1.0 GB) copied, 40.0551 seconds, 26.2 MB/s

So it seems transfers are faster from the client to the server than the other way, and that transmission without disk access is much faster on the server, but not on the client.

client:

Code:

ethtool eth0
Settings for eth0:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Full 
    Supports auto-negotiation: Yes
    Advertised link modes:  10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Full 
    Advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 0
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: umbg
    Wake-on: d
    Link detected: yes

server:

Code:

ethtool eth0
Settings for eth0:
    Supported ports: [ TP ]
    Supported link modes:   10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Half 1000baseT/Full 
    Supports auto-negotiation: Yes
    Advertised link modes:  10baseT/Half 10baseT/Full 
                            100baseT/Half 100baseT/Full 
                            1000baseT/Half 1000baseT/Full 
    Advertised auto-negotiation: Yes
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    PHYAD: 0
    Transceiver: internal
    Auto-negotiation: on
    Supports Wake-on: g
    Wake-on: d
    Current message level: 0x00000037 (55)
    Link detected: yes

Does anyone know what is causing the performance bottleneck or how to fix it?

Nathanael · 05-21-2008, 08:04 AM

watch vmstat 1 while accessing the share... esp the IO part (could be of interest)

stefan_nicolau · 05-21-2008, 11:43 AM

Quote:

Originally Posted by Nathanael

watch vmstat 1 while accessing the share... esp the IO part (could be of interest)

client:

Code:

dd if=bigfile bs=1M count=1000 of=/dev/null
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 58.8828 s, 17.8 MB/s

server:

Code:

vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0     96   8392 428544  20224    0    0    25     1   71   15  0  0 99  0
 0  0     96   8392 428544  20224    0    0     0    11  273   22  0  0 100  0
 0  0     96   8392 428544  20224    0    0     0     0  254    3  0  0 100  0
 0  8     96   6496 415124  35396    0    0 15220     0 6208 1923  0 17 39 44
 0  8     96   5632 400292  51768    0    0 16336     0 6568 2171  0 20 22 59
 1 10     96   6532 381768  70324    0    0 18676    19 7601 2923  0 20 12 68
 0  8     96   6524 366808  86224    0    0 15856     1 6451 2723  0 19 12 69
 1  4     96   5908 352232 102268    0    0 16080     0 6701 2340  0 19 19 62
 0  8     96   6036 335616 119328    0    0 17072    12 7120 1333  0 17 43 41
 2  7     96   5716 317712 138752    0    0 19440     0 8071 1558  0 19 43 38
 2  2     96   6132 300580 156796    0    0 18100     0 7508 1888  0 20 32 48
 0  5     96   6608 283444 174196    0    0 17456     0 7047 2605  0 19 19 62
 0  5     96   6032 266312 192496    0    0 18260     0 7538 2600  0 20 21 58
 0  1     96   6596 250784 208212    0    0 15852    12 6608 1286  0 17 44 39
 0  5     96   6160 236964 222996    0    0 15856     0 6470 1417  0 15 37 48
 0  7     96   5864 220844 239636    0    0 17460     0 7074 1915  0 21 26 54
 1  3     96   6800 203580 256116    0    0 17712   126 7376 2274  0 23 28 49
 0  8     96   5784 187464 273748    0    0 18416     0 7443 2097  0 21 27 53
 0  5     96   6404 170212 290648    0    0 18004    12 7611 2636  0 21 28 51
 0  8     96   6028 153328 308700    0    0 19056     0 7776 2304  0 21 29 50
 0  0     96   5700 137220 325596    0    0 17940     0 7553 1848  0 21 31 47
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  8     96   6328 119956 342656    0    0 18096     8 7294 2079  0 20 20 59
 0  8     96   5932 104104 359524    0    0 17668     0 7298 2008  0 18 33 49
 0  5     96   6808  85692 377228    0    0 19008    12 7842 2362  0 24 22 54
 0  8     96   6020  69584 394604    0    0 18132     0 7570 1488  0 17 40 43
 0  8     96   5824  54628 410312    0    0 16784     0 6866 1820  0 18 32 51
 0  7     96   6676  37684 426904    0    0 17936     0 7416 1922  0 21 28 51
 0  8     96   6228  20812 444908    0    0 19060     0 7843 2213  0 20 27 53
 0  8     96   5968   6240 460320    0    0 16752    15 7006 1687  0 17 35 48
 1  0     96   5620   3712 463876    0    0 18000     5 7542 2393  0 21 28 51
 0  8     96   5952   3584 463928    0    0 17072     0 6893 1652  0 18 34 48
 0  6     96   5932   3596 464376    0    0 17524     0 7178 2128  0 18 29 53
 0  5     96   6368   3604 463988    0    0 16816     0 6899 1479  0 20 34 46
 0  1     96   6388   3628 464588    0    0 17776    12 7376 1707  0 20 32 47
 0  8     96   6284   3640 464880    0    0 18612     0 7539 2194  0 20 27 53
 8  8     96   6308   3652 465228    0    0 18640     0 7660 1771  0 18 36 46
 0  6     96   5596   3668 465972    0    0 16560     0 6912 1592  0 21 32 47
 0  6     96   6168   3688 466100    0    0 18388     0 7512 1932  0 19 30 50
 0  8     96   5596   3696 466416    0    0 15500     0 6364 1361  0 15 38 48
 0  8     96   5660   3724 466488    0    0 17844    12 7379 1793  0 22 34 44
 2  2     96   6060   3736 466452    0    0 19700     0 8124 2323  0 23 26 50
 2  7     96   6384   3752 466232    0    0 18896     0 7932 1780  0 20 35 45
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  8     96   6496   3736 466504    0    0 18100     0 7618 2461  0 18 26 56
 0  8     96   6284   3756 466760    0    0 16276     0 6727 1703  0 18 35 47
 0  8     96   6224   3776 466992    0    0 15916    16 6563 1363  0 18 37 45
 8  8     96   6012   3792 467156    0    0 16272     0 6706 1804  0 17 31 52
 1  8     96   6688   3804 466536    0    0 16528     0 6853 1792  0 20 22 58
 0  6     96   5780   3732 467516    0    0 17332     0 7108 1672  0 17 36 48
 0  8     96   6096   3668 467352    0    0 16592     0 6788 1466  0 16 36 48
 0  5     96   6056   3500 467680    0    0 18480     0 7649 2146  0 22 30 48
 0  7     96   6540   3220 467276    0    0 16336    12 6761 2314  0 17 18 65
 0  8     96   6748   3220 467356    0    0 16112     0 6681 1892  0 15 29 56
 0  5     96   6268   3188 467944    0    0 17520     0 7252 2150  0 19 26 55
 0  8     96   6780   2856 467848    0    0 18100     0 7401 2117  0 20 25 55
 0  7     96   6540   2556 468400    0    0 16016     0 6618 1376  0 15 41 43
 0  6     96   6440   1832 469580    0    0 18736     0 7715 2168  0 21 27 53
 0  8     96   6316   1576 470352    0    0 17588    12 7252 2167  0 20 24 56
 0  5     96   5908   1548 470760    0    0 17493     0 7327 2126  0 15 24 61
 0  8     96   5812   1496 471056    0    0 18097     3 7450 2271  0 21 33 46
 0  8     96   5580   1496 471180    0    0 17296     0 7161 2216  0 22 26 52
 2  7     96   6172   1504 470800    0    0 17940     0 7396 1874  0 18 31 52
 0  0     96   6056   1500 471028    0    0 11052     4 4781 1061  0 11 61 28
 0  0     96   6056   1508 471012    0    0     0    12  265   33  0  0 100  0
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0     96   6048   1508 471012    0    0     0     0  315   85  0  0 100  0
 0  0     96   6296   1508 471012    0    0     0     0  254   11  0  0 100  0

I'm not sure what to look for. The read speed from the disk on the server is constant and it is the same as the transfer speed reported on the client.

andrewdodsworth · 05-22-2008, 05:25 PM

I put gigabit in about a year ago and expected to get about 900 Mbps. In practice I can get about 300-350 Mbps sustained transfer rates. I did some tweaking but didn't really improve things much. I suspect the PCI bus is in fact a bottleneck - unless you only have one device on it and the data is going one way!

In the end because I was achieving 3 times the speed of 100BaseT and close to raw disk performance there wasn't much point in pursuing it further.

stefan_nicolau · 05-23-2008, 06:24 AM

Quote:

Originally Posted by andrewdodsworth

I put gigabit in about a year ago and expected to get about 900 Mbps. In practice I can get about 300-350 Mbps sustained transfer rates. I did some tweaking but didn't really improve things much. I suspect the PCI bus is in fact a bottleneck - unless you only have one device on it and the data is going one way!

In the end because I was achieving 3 times the speed of 100BaseT and close to raw disk performance there wasn't much point in pursuing it further.

I agree with you, I don't expect 900mb/s. I expect 35-40mB/s (around 300mb/s), and that's what I get from the client to the server. My problem is that the server can only send data at half that speed. So it seems the problem is not the bus, because that would also slow down receiving on the server.

What kind of tweaking have you done to try to make yours faster?

Pearlseattle · 05-23-2008, 07:15 AM

What happens when you use ftp? E.g. with an ftp-server like pure-ftpd?

andrewdodsworth · 05-23-2008, 07:47 AM

My main server is still running with SuSE 10.0, the other machines were running openSuSE 10.2 I think.

First thing I did was to get the latest gig drivers for my cards (they were all RTL8169 chipset).

The other was to look at the sysctl settings for:
net.ipv4.tcp_wmem
net.ipv4.tcp_rmem
net.ipv4.tcp_mem
eg as root:

Code:

server:#sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096        16384   131072

On my server these are set quite low:

Code:

net.ipv4.tcp_rmem	4096		87380		174760
net.ipv4.tcp_wmem	4096		16384		131072
net.ipv4.tcp_mem	49152		65536		98304

On the other machines they were much higher:

Code:

net.ipv4.tcp_rmem	4096		87380		4194304
net.ipv4.tcp_wmem	4096		16384		4194304
net.ipv4.tcp_mem	196608	        262144	        393216

On one machine I got the same asymmetric performance issue as you and used ethereal (now wireshark) to see what was happening. There were lots of TCP errors. I reduced the final numbers for wmem and rmem and played around until I got stable performance. Rather than 4194304 I used 1048576. The sysctl variables can be set like:

Code:

sysctl -w net.ipv4.tcp_wmem="4096 16384 1048576"

Since then I've done fresh installs of openSuSE 10.3 on all the other machines without needing to upgrade the drivers or tweak the sysctl variables. On the machine that had the problems the variables are now:

Code:

net.ipv4.tcp_wmem = 4096        16384   3117056
net.ipv4.tcp_rmem = 4096        87380   3117056
net.ipv4.tcp_mem = 73056        97408   146112

I use iperf to run tests between machines and the latest figures are as follows:

Code:

iperf -c <server> -i 1 -t 10
------------------------------------------------------------
Client connecting to <server>, TCP port 5001
TCP window size: 16.0 KByte (default)
------------------------------------------------------------
[  3] local <client> port 18583 connected with <server> port 5001
[  3]  0.0- 1.0 sec  39.3 MBytes    330 Mbits/sec
[  3]  1.0- 2.0 sec  41.5 MBytes    348 Mbits/sec
[  3]  2.0- 3.0 sec  41.5 MBytes    348 Mbits/sec
[  3]  3.0- 4.0 sec  41.5 MBytes    348 Mbits/sec
[  3]  4.0- 5.0 sec  39.6 MBytes    332 Mbits/sec
[  3]  5.0- 6.0 sec  41.2 MBytes    346 Mbits/sec
[  3]  6.0- 7.0 sec  41.3 MBytes    346 Mbits/sec
[  3]  7.0- 8.0 sec  40.9 MBytes    343 Mbits/sec
[  3]  8.0- 9.0 sec  40.6 MBytes    341 Mbits/sec
[  3]  9.0-10.0 sec  40.5 MBytes    340 Mbits/sec
[  3]  0.0-10.0 sec    408 MBytes    342 Mbits/sec

Interestingly when I run it between my production server and the new server (running 10.3) which I'm going to swap to, I get 335 Mbits/sec one way and 496 Mbits/sec the other way. The new server is a dual core AMD Opteron with a few gig of RAM so I would expect asymmetric performance. If I run two clients against the new server they both get about 300Mbits/sec so it does look like CPU and bus performance affect things.

Hope this helps.

stefan_nicolau · 05-23-2008, 04:06 PM

Thanks for the advice. I increased all the buffers to 4mB and the asymmetric performance issue seems gone. However, I get 500mb/s with iperf, which is good, but I only get 320mb/s with netcat and 136mb/s with nfs, which is exactly what I had before. (Yes, I remounted before testing.) I tried both nfs v3 and v4 over tcp. Nfs rsize and wsize are set to 64k.

stefan_nicolau · 05-23-2008, 04:11 PM

Quote:

Originally Posted by Pearlseattle

What happens when you use ftp? E.g. with an ftp-server like pure-ftpd?

FTP performance is around 22mB/s (vsftpd). NFS is at 19mB/s.

andrewdodsworth · 05-23-2008, 05:19 PM

Looks like the usual benchmarking vs real world issue. As I said before once I got it stable given that raw disk performance (again using hdparm not real world!) limits data transfer to about 50 MB/sec I didn't see the point in looking further. Maybe wireshark could tell you more about the packet sizes and whether any TCP errors were slowing things down.

lazlow · 05-23-2008, 06:50 PM

I seem to recall that Nfs is happier with udp rather than tcp. Something about collisions and resends.

BrianK · 05-23-2008, 07:07 PM

Quote:

Originally Posted by stefan_nicolau

FTP performance is around 22mB/s (vsftpd). NFS is at 19mB/s.

these are very similar. I suspect something other than network being your bottleneck - namely drives. What is your disk setup? have you tried a dd test as a *write*, i.e.

time dd if=/dev/zero of=some_file bs=10M count=100 <-- do this on the server, not over nfs.

what are your MTU settings on both client and server (as can be seen via ifconfig)?

what sort of switch are you using?

do you see a difference in speed when fetching a file from the nfs server as opposed to writing one?

stefan_nicolau · 05-24-2008, 04:22 AM

The server has a 10G SCSI disk for the system, a 320G ide disk for /home and a 1T sata disk for file storage (mostly pictures and video).

The write performance (on the server) is 24mB/s for the ide disk, 53mB/s for sata and an abysmal 16mB/s for scsi (the disk is almost 10 years old). I have a spare ide controller somewhere, maybe that can improve ide performance.

NFS write is 12mB/s for the ide disk and 17 mB/s for the sata disk. Read is 18 and 21.

I tried with MTU of both 1500 and 9000 (jumbo frames) on both the client and the server (yes, the switches support it). The switches are unmanaged Netgear and D-link gigabit switches.

I just found something that may be interesting: if I get a file over nfs, remount the export (to flush the client cache, but keep the server's) and get it again, the speed is around 60mB/s. So it does seem to be something about the disks on the server.

andrewdodsworth · 05-24-2008, 05:30 AM

I think that confirms it - if iperf is symmetrical and ~500 Mbps then the networking is going as fast as it can and probably limited by PCI bus. When I had asymmetric performance and some very low figures on one machine it was due to TCP errors and retries. Changing the net.ipv4 sysctl variables fixed that.
By the way when rerunning iperf yesterday and getting ~330 Mbps, on one machine I was only getting ~ 200Mbps. However, at the time there was a mysql process consuming about 60% cpu at the time so it also depends on CPU usage as well as PCI bus.
I also had a passing look at jumbo frames but when experimenting with the sysctl variables and also looked into the theoretical performance gains that could be made. However, they were marginal at best and decided that it was irrelevant because of the PCI bus and disk access limitations.
I think my newest server can manage ~ 600 Mbps but the clients can't keep up!
The new server has a SATA drive but can only get ~ 75 MB/sec which isn't much better than IDE so maybe I'll have a look at that myself.

stefan_nicolau · 05-24-2008, 08:34 AM

Code:

dd if=/dev/sda of=/dev/null bs=1M&
iperf -s

gives 76mB/s on iperf and 30mB/s on sda, for a total of 106mB/s on the bus

Code:

dd if=/dev/hda of=/dev/null bs=1M&
dd if=/dev/sda of=/dev/null bs=1M&
iperf -s

gives 57mB/s in iperf, 13mB/s on hda and 27mB/s on sda, for a total of 97mB/s on the bus. (Performance is the same when reading from a file rather than raw disk access.)

So the bottleneck is not on the bus. What I found interesting is that the disk performance goes down by half under heavy network usage. CPU during the combined iperf/dd runs is 70sys/0id/30wa. dd uses 45% cpu. So that's the bottleneck. Maybe I should first look at lowering cpu usage for disk access (is it possible? dma and acpi are already on.)

But during an nfs operation, the cpu is 25sys/10id/65wa and I only get 18mB/s. Why is wait so high if neither device is at full speed and the cpu is not maxed?