LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Networking (https://www.linuxquestions.org/questions/linux-networking-3/)
-   -   Differing network performance for identically connected hosts (https://www.linuxquestions.org/questions/linux-networking-3/differing-network-performance-for-identically-connected-hosts-683560/)

Sam1984 11-15-2008 08:09 AM

Differing network performance for identically connected hosts
 
Afternoon all,

I'm having a bit of a nightmare with differing network performance on two servers in the same rack, connected in exactly the same way. Here's the situation:

Host 1 - CentOS 4.6 (2.6.9-67.0.4.ELsmp), HP DL360 G4
Host 2 - CentOS 5.1 (2.6.18-92.1.6.el5), HP DL160 G5

Both are connected to the same Cisco switch at 100/FULL (hard coded), with the same config on each port. Both are in the same VLAN. No errors are present on the switchports, or on the host themselves.

Transferring files between servers is pretty much 100Mbps flat out, and transfers to hosts in other similarly connected racks performs very well too.

The difference comes when we have a slower client. For example, a remote client is connected at about 16Mbps. Downloading from Host 1 clocks in ~1.5MB/s. Downloading from Host 2 never exceeds 1.2MB/s. I'm measuring the downloads (fairly crudely) using curl to request a 15MB file from Apache, running on both boxes (same file, same Apache config).

Here's what I've tried so far, all without success:

1. Using a different webserver on both (thttpd)
2. Disabling iptables on both
3. Disabling ipconntrack on Host 2 (as it's installed by default on CentOS5)
4. Decreasing the MTU on both
5. Optimising the TCP stack on host 2 (increasing default TCP window, using cubic congestion avoidance, etc)
6. Changing the network interface on Host 2 (to use a gigabit uplink, different cable and different network and switch port)

Finally, I have run tcpdump to monitor transfers for both servers (client initiates the download of 15MB from the server). They both rise quickly to the same maximum window size, and the only discernible difference I can see is that Host 2 has a lot more "TCP Window Full" entries than Host 1 (with the server sending that message to the client).

Does anyone have any suggestions? I've been looking at this for days now!

Thanks in advance,

Sam

estabroo 11-15-2008 08:55 AM

Try testing with iperf, it might help you narrow it down, it keeps track of lost packets and resends, could be a flakey card causing lots of drops which on the non-local connection will cause tcp to back off.

Sam1984 11-15-2008 10:27 AM

Thanks, but unfortunately iperf isn't really much help in this instance. I can't push data to the client from the server (it's firewalled), I can only download from the server whilst connected to the client (and iperf doesn't support this).

After running a few more packet captures on Host 2 it seems that there are quite a lot of duplicate ACKs (one every 5-10 packets it seems), and also the rate at which the client host acknowledges data sent to it from the server is also vastly increased.

There are no retransmissions though, so I don't think packet loss is the cause.

Thanks,

Sam

jiobo 11-15-2008 12:44 PM

ship it!
 
Hi Sam,

Sounds like a real problem...server 2 will have to be quarantined. Pack it up and ship it to me ASAP!

Sam1984 11-15-2008 01:18 PM

jiobo - Haha at this rate I might as well !

Seriously though, I've been narrowing it down this afternoon...

I believe the problem is caused by not only these duplicate ACKs, but also by the fact that the client appears to be ACK'ing too often. If the client connects to Host1, the ACKs are far less frequent.

Another host with CentOS 5 does not exhibit this problem, although this has a realtek card. A couple of other CentOS 5 hosts (with either the Broadband tg3 or bnx2 driver) do exhibit the issue. I'm starting to think it might be chipset/driver related.

I copied the TCP parameters of the "good" CentOS 5 host to Host2, and there was no change.

I'm going to make a trip to the datacenter to fit a dual GigE Intel card to see if that helps matters.

Still welcoming any other suggestions...

Sam



tcpdump output (client is 192.0.0.1, host2 is 1.0.0.1)

17:25:34.866198 IP 192.0.0.1.2173 > 1.0.0.1.80: S 3261466931:3261466931(0) win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 0>
17:25:34.866205 IP 1.0.0.1.80 > 192.0.0.1.2173: S 3117770620:3117770620(0) ack 3261466932 win 5840 <mss 1460,nop,nop,sackOK,nop,wscale 2>
17:25:34.884499 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 1 win 5840
17:25:34.884766 IP 192.0.0.1.2173 > 1.0.0.1.80: P 1:158(157) ack 1 win 5840
17:25:34.884778 IP 1.0.0.1.80 > 192.0.0.1.2173: . ack 158 win 1728
17:25:34.885079 IP 1.0.0.1.80 > 192.0.0.1.2173: . 1:2921(2920) ack 158 win 1728
17:25:34.904089 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 1 win 5840 <nop,nop,sack 1 {1461:2921}>
17:25:34.904103 IP 1.0.0.1.80 > 192.0.0.1.2173: . 2921:4381(1460) ack 158 win 1728
17:25:34.904106 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 2921 win 8760
17:25:34.904111 IP 1.0.0.1.80 > 192.0.0.1.2173: . 4381:5841(1460) ack 158 win 1728
17:25:34.904115 IP 1.0.0.1.80 > 192.0.0.1.2173: . 5841:7301(1460) ack 158 win 1728
17:25:34.925830 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 7301 win 17520
17:25:34.925837 IP 1.0.0.1.80 > 192.0.0.1.2173: . 7301:8761(1460) ack 158 win 1728
17:25:34.925840 IP 1.0.0.1.80 > 192.0.0.1.2173: P 8761:11681(2920) ack 158 win 1728
17:25:34.925857 IP 1.0.0.1.80 > 192.0.0.1.2173: . 11681:13141(1460) ack 158 win 1728
17:25:34.964395 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 13141 win 29200
17:25:34.964406 IP 1.0.0.1.80 > 192.0.0.1.2173: . 13141:20441(7300) ack 158 win 1728
17:25:34.982275 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 20441 win 30660
17:25:34.982284 IP 1.0.0.1.80 > 192.0.0.1.2173: . 20441:29201(8760) ack 158 win 1728
17:25:34.986048 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 20441 win 30660
17:25:34.990006 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 20441 win 30660
17:25:35.012126 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 29201 win 30660
17:25:35.012131 IP 1.0.0.1.80 > 192.0.0.1.2173: P 29201:35041(5840) ack 158 win 1728
17:25:35.012147 IP 1.0.0.1.80 > 192.0.0.1.2173: . 35041:36501(1460) ack 158 win 1728
17:25:35.032742 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 35041 win 30660
17:25:35.032752 IP 1.0.0.1.80 > 192.0.0.1.2173: . 36501:46721(10220) ack 158 win 1728
17:25:35.036200 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 35041 win 30660
17:25:35.049875 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 43801 win 30660
17:25:35.049881 IP 1.0.0.1.80 > 192.0.0.1.2173: P 46721:51101(4380) ack 158 win 1728
17:25:35.049886 IP 1.0.0.1.80 > 192.0.0.1.2173: . 51101:56941(5840) ack 158 win 1728
17:25:35.053919 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 46721 win 30660
17:25:35.053937 IP 1.0.0.1.80 > 192.0.0.1.2173: . 56941:61321(4380) ack 158 win 1728
17:25:35.058089 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 46721 win 30660
17:25:35.067922 IP 192.0.0.1.2173 > 1.0.0.1.80: . ack 56941 win 30660

Sam1984 11-15-2008 07:49 PM

Solved!
 
Solved :-)

Disabling TCP/IP Offloading resolved it. Googling around, it seems that this is a common(ish) issue with broadcom cards and the CentOS 5 kernel.

The lone command "ethtool -K eth0 tso off" fixed all my problems.

Thanks,

Sam


All times are GMT -5. The time now is 10:37 PM.