[SOLVED] Huge packet loss on a gigabit link

rentze · 12-28-2010, 01:13 PM

Hi,

I have a following configuration: 4 PCs (say, A, B, C and D), running Ubuntu or Debian, interconnected using a gigabit switch, which is connected to the Internet. Two machines (say, A and B) also have a direct private connection between them (provided by another pair of NICs).

Now, when I test the connection performance with iperf, the results vary. The private connection between A and B performs well - about 930Mbps using iperf's UDP test. Between C and D it is about 800Mbps which I find tolerable. Packet loss when running these tests is negligible. However, when I run iperf between any of {A,B} and {C,D}, the performance significantly drops as there is a huge number of lost packets. For example, here is the result of testing between A and C:

[ 3] local xxx.xxx.xxx.xxx port 34702 connected with xxx.xxx.xxx.xxx port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 834 MBytes 700 Mbits/sec
[ 3] Sent 594940 datagrams
[ 3] Server Report:
[ 3] 0.0-10.2 sec 179 MBytes 147 Mbits/sec 12.645 ms 467089/594938 (79%)
[ 3] 0.0-10.2 sec 1 datagrams received out-of-order

Why is there such a large number of packets which are generated, but lost somewhere?
A<->B private link works fine, so system level parameters on both A and B are correct. Furthermore, C<->D works ok, so I guess I shouldn't blame the switch.

Is there a per-NIC configuration that I should check or it smells like a hw problem? Problematic NICs on both A and B are of the same type - Allied Telesyn AT2916T.

Thanks... at least for reading this

never say never · 12-28-2010, 05:18 PM

First blush says configuration issue (most likely duplex). You don't say how your systems are configured Are they configured to auto negotiate or to a specific configuration? What is the wire distance?

Auto negotiation is great when it works and a nightmare when it doesn't.

Another possibility is buffer overflow (the NIC buffer fills before the system empties the buffer).

If you are still having trouble post back with configs, and make sure not to use the same xxx ip address.

Be more accurate with ips, something like xxx.xxx.xxx.aaa, xxx.xxx.xxx.bbb, xxx.xxx.yyy.ccc, xxx.xxx.yyy.ddd so we can tell each system apart.

djtoltz · 12-28-2010, 10:43 PM

You can use ifconfig to see the configuration and errors on a single network interface.

rentze · 12-29-2010, 06:41 AM

Thanks for your suggestions! Now I am a bit closer to the source of the trouble. Namely, I forgot to mention that A and B are not running an ordinary kernel: they are both Xen Dom0-s. When I reboot them with the same kernel, but without Xen hypervisor, the huge packet loss disappears. The performance is not great though, I am getting like 575 Mbps uplink and 690 Mbps downlink (everything is configured by auto-negotiation, I don't specify anything explicitly). Still, this bandwidth is perfectly fine for me, I just want to get rid of the packet loss problem.

Furthermore, I have discovered that the problem occurs only when A or B act as receivers. Here is the score (A=iperf server=receiver, C=iperf client=sender):

Quote:

[ 3] local xxx.xxx.xxx.aaa port 5001 connected with xxx.xxx.xxx.bbb port 38590
[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 3] 0.0-10.2 sec 324 MBytes 266 Mbits/sec 12.213 ms 356145/587534 (61%)
[ 3] 0.0-10.2 sec 1 datagrams received out-of-order

ifconfig reports no problems, but here is the output of netstat -su:

Quote:

IcmpMsg:
InType3: 19
OutType3: 27
OutType8: 14
Udp:
1181505 packets received
108486 packets to unknown port received.
2348066 packet receive errors
1481319 packets sent
RcvbufErrors: 2348066
UdpLite:
IpExt:
InMcastPkts: 17
InBcastPkts: 1926
InOctets: 1154882666
OutOctets: -2076309331
InMcastOctets: 476
InBcastOctets: 386075

(this is after serveral tests. After each one, not surprisingly, RcvbufErrors increases by the same number as the number of lost packets reported by iperf).

Increasing UDP receiving buffer didn't help. Xen network bridge is turned off.

Any other suggestions? How to determine precisely where the packets get dropped? Judging from all this, it is Xen's fault, so I'll try to explore their mailing lists...

Dani1973 · 12-29-2010, 07:22 AM

Have you tried testing this from a virtual machine and not dom0?

Or higher the specs of your Dom0 (especially RAM) for testing purpose.
The result you receive come close to transfer rates of single hard drives (Dom0 using swap for the received or send data???).

okcomputer44 · 12-29-2010, 09:38 AM

Quote:

Originally Posted by rentze

Thanks for your suggestions! Now I am a bit closer to the source of the trouble. Namely, I forgot to mention that A and B are not running an ordinary kernel: they are both Xen Dom0-s. When I reboot them with the same kernel, but without Xen hypervisor, the huge packet loss disappears. The performance is not great though, I am getting like 575 Mbps uplink and 690 Mbps downlink (everything is configured by auto-negotiation, I don't specify anything explicitly). Still, this bandwidth is perfectly fine for me, I just want to get rid of the packet loss problem.

Furthermore, I have discovered that the problem occurs only when A or B act as receivers. Here is the score (A=iperf server=receiver, C=iperf client=sender):

ifconfig reports no problems, but here is the output of netstat -su:

(this is after serveral tests. After each one, not surprisingly, RcvbufErrors increases by the same number as the number of lost packets reported by iperf).

Increasing UDP receiving buffer didn't help. Xen network bridge is turned off.

Any other suggestions? How to determine precisely where the packets get dropped? Judging from all this, it is Xen's fault, so I'll try to explore their mailing lists...

Well I had nightmare with Xen too. The bonding does not work with Xen at all either. http://www.cyberciti.biz/tips/linux-...interface.html

It took me a while to figure out why on earth does not work the bonding.
When I rebooted my server I just realized it uses the Xen kernel instead of the regular one.
I changed it and after that everything worked well. Before that I lost every second ping packets!
So the Xen kernel could cause many network issue(es).

rentze · 12-29-2010, 12:30 PM

Well, I have discovered the bottleneck. Actually, the problem is the CPU. Under a "normal" kernel, processing of iperf burns 100% cycles of one, and about 50% of the other core. In the configuration with Xen, originally I had only one core dedicated to Dom0... by far insufficient for this kind of processing! Even with both the cores activated, all the cycles get consumed (because networking requires more "thinking" under Xen) and the problem persists.

Now I guess that the solution is to buy a faster processor

Thanks again for all your suggestions!