SSH/rsync "Connection reset by peer"

hamish · 12-12-2004, 08:16 AM

Hey

I have bene experiementing with rsync over ssh inside my private network (so the connection speeds are very high). however i keep getting problems with the rsyncing, which causes the connection to break.

here is the error code:

Code:

rsync -avzl -e ssh /home/hamish/* hamish@hamishnet.homelinux.com:/tmp/hamish
.
.
.
.

website/.wine/fake_windows/Games/SimGolf/top10.sve
website/.wine/fake_windows/Games/SimGolf/unpack.exe
Read from remote host hamishnet.homelinux.com: Connection reset by peer
rsync: writefd_unbuffered failed to write 16385 bytes: phase "unknown": Broken pipe
rsync error: error in rsync protocol data stream (code 12) at io.c(666)
hamish@archimedes hamish $

is this a problem with the server, or with rsync?

I am transfering a huge amount (as this is the first sync) of about 2gb. Can rsync just not cope?

Or is is SSH timing out? Also, what actually counts as timing out? If files are copying, is it possible for it to time out? Can i set the server never to break the connection while the files are transfering?

Thanks
Hamish

not_an_expert · 01-06-2005, 11:02 PM

I have been fighting this problem for quite a while now and do not believe that it is an application level issue. I think it is in the Linux IP stack. This problem is cropping up wherever folks are trying to perform large streaming data transfers over a LAN. It shows up in Apache, SSH/SCP, FTP, and Samba. There are hundreds of hits for this out there in Googleville, but there have as yet been no answers to the problem. SSH is especially good at causing the problem.

Maybe this will help:

I have a mixed environment of Linux, OS X, and Windows servers. I see this issue occur whenever I attempt to move a lot of data over the network to a Windows server. When I use a Linux FTP client to PUT some large files,
the transfer rate starts at about 11.8 MB/s and holds there until the link is reset by the peer. (W2K server) If I use a Mac OS X FTP client to do the transfer, it starts off at maybe 5-6 MB/s, then falls back to around 3.8 -4.0 MB/s
for the balance of the transfer. The Macs cannot crash the connection, and neither can another Windows host. Both the Macs and Windows boxes are exhibiting congestion-limiting behaviors. The Linux boxes are not.

The Linux IP stack does not appear to be paying attention to ICMP source quench (Type 4) messages. It just keeps sending packets as fast as it can until the all the target's buffers fill up. Once that happens, the target has to stop the data flow by killing the connection. When a host or a router is being sent data faster than it can get it onto disk or out another NIC, it has to buffer it in memory. If the data is bursty, the buffers give you time to move the data onto your disks or onto another network. If the data streams in smoothly, the buffers fill and the host has nowhere to put more data. The only thing it can do then is pull the plug by resetting the TCP connection.

The proper response to congestion is to send source quench packets to the sender. The sender is then supposed to reduce transmission speed, allowing the receiver to catch up. Since all the other machines I have are lowering their rates for the Windows server, it must be sending source quench messages. I think Linux is ignoring them.

I am not good enough at writing tcpdump capture filters to verfiy this. Emprically, I was able to determine that I could delay the connection resets by increasing the size of the cache RAM chip on the RAID controller. A test transfer that fails with a repeatability of +/- 10 MB fails 128MB later after extending the cache RAM by that amount. The RAID controller is fast enough to keep up with the 100 Mb/s network stream, as long as it can write to cache. When it has to go to disk, it can't keep up and the connection gets reset.

I think the reason SSH is so good at exposing this behavior is the additional cryptographic processing load it imposes on the target machine. If the CPU is busy doing crypto it has less time to handle disk I/O. The buffers fill faster and the link resets sooner.

I think the blame for this one belongs to the guys maintaining the IP stack, not the SSH, RSYNC, SAMBA or Apache guys. If the Linux stack refuses to back off when the clients (or their routers (see congestion collapse)) are becoming saturated with data, "connection reset by peer" is the appropriate last-ditch response.

Dave Rutledge

not_an_expert · 01-08-2005, 07:10 PM

I was not correct about the source quench packets. I dragged Comer down off the shelf and refreshed a bit. The clients are using TCP sliding windows to control transmission rates.

I did a capture on this today and the Windows server had reduced its TCP window from an initial value of about 64k down to about 6000, but the Linux hosts just kept blasting away at full speed. It is certainly not waiting for ACK's. I saw it send 6 1460 byte packets after it had just received an ACK with a window size of around 6000 from its victim. Then the link reset again.

The send transfer rate on KDE system guard looks like a square wave. It has only two speeds: 0% and 100%.

I am pretty sure something is broken in there. I just don't know how to proceed. I submitted it as a bug against one of my licensed Redhat systems. I don't know if this is in the kernel or if the application developers have done something silly like overriding the congestion avoidance algorithms. This behavior is evident in the current RHEL kernel (2.4.20-something, it's at work) and continues up until at least 2.6.9.

I have been trying to figure out how to contact those Lords of Creation, the Maintainers of the Sacred and Holy Kernel, but that seems more like a vision quest than a viable support structure. There seems to be some requirement for animal sacrifice, burnt offerings, and secret initiations involved in finding these guys. Also something about crawling up a mountain backwards on one's hands and knees.

Their attitude lends credence to M$FT's FUD about not being able to get things fixed in Linux. I realize that they are volunteers and can't talk to the whole bloody world about every little issue tha pops up, but there ought to be a formalized way to gain an audience with them.

kashani · 01-19-2005, 03:57 PM

I'm seeing exactly the same thing on my system as well.

video1.lax is trying to rsync to video1.iad. Files are 100MB and up since it's all video. video1.iad is heavily loaded serving video and we're tunneling the whole thing over ssh. I'd guess we're on the cusp since sometimes the rsync will go for 20 minutes and other times for 2 minutes. I can't point to any set timelimit, but the load on the server does fuctuate pretty dramatically at times.

I'm starting packet traces to verify this, but the description of our issue is so spot-on I'm sure it'll match. Nice work, not_an_expert.

kashani

ets_adm · 09-13-2005, 02:27 AM

1) I've never seen this problem while using `vanilla' kernels: everything
works well over Gbit ethernet, ADSL, modem etc., and transfer speeds are
indeed higher when ms systems are kept out of the area...

2) I have seen this problem of a disconnect with a RedHat (2.6.9-11.ELsmp)
kernel, with very low throughput (just an ssh console). It's annoying.
It may indeed be the kernel, but I bet that's a RedHat thing. So it's not
really fair to blame kernel people for not trying to help, although a reply
would be nice (I assume from your description of them that you've not had
a reply?)

3) My other RHEL system has the non-SMP kernel, and doesn't have this
problem. I run vanilla SMP kernels on other machines, with gentoo systems,
and have not the slightest problem. Perhaps this narrows the problem a little...

merdyn · 12-08-2005, 12:48 PM

Has there been any resolution to this issue? I'm experiencing it in a Windows-only environment. Both the backup server and the client are running Windows XP SP2. The server is running Cygwin with its SSH and rsync packages installed. The client is using cwRsync. The rsync process initiates fine and gets about 3/4 of the way through before having its connection reset.

Any ideas?

-Joel

The_JinJ · 12-10-2005, 06:09 AM

I noticed that if you test for the max upload speed of your line and set the rsync bwlimit to below it the error occurs less frequently and you can move greater amounts of data.