I am maintaining a remote server over SSH. To ensure that my maintenance only has a minor impact on the server's network speed, I rate-limit my outbound traffic to 200KiB/s by dropping packets using the following iptables rule:
Code:
# ip6tables -A INPUT -p tcp -m hashlimit --hashlimit-above 200kb/s -m tcp --destination 3ffe:ffff::dead:beef --dport 22 -j DROP
# ip6tables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP tcp anywhere 3ffe:ffff::dead:beef limit: above 200kb/s tcp dpt:22
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
However, now when I saturate the link, the SSH sessions drop in predictable 15 minute time intervals:
Code:
$ for TRIAL in `seq 1 5`
> do
> yes | dd status=progress if=/dev/stdin bs=1k count=$((500*1024)) 2>dd.$TRIAL.log |
> ssh -vvv remotehost 'cat >/dev/null' 2>&1 |
> while read LINE
> do
> printf '%s\t%s\n' `date +%H:%M:%S` "$LINE"
> done | tee output.$TRIAL.log
> done
$ tail <output.1.log
20:51:53 debug2: channel 4: rcvd adjust 131072
20:51:54 debug2: channel 4: rcvd adjust 131072
20:51:55 debug2: channel 4: rcvd adjust 131072
20:51:55 debug2: channel 4: rcvd adjust 131072
20:51:56 debug2: channel 4: rcvd adjust 131072
20:51:56 debug2: channel 4: rcvd adjust 131072
20:51:57 debug2: channel 4: rcvd adjust 131072
20:51:58 debug2: channel 4: rcvd adjust 131072
20:51:58 debug3: send packet: type 1
20:51:58 packet_write_wait: Connection to 3ffe:ffff::dead:beef port 22: Broken pipe
$ for TRIAL in `seq 2 5`; do tail -n 1 <output.$TRIAL.log; done
21:07:34 packet_write_wait: Connection to 3ffe:ffff::dead:beef port 22: Broken pipe
21:23:11 packet_write_wait: Connection to 3ffe:ffff::dead:beef port 22: Broken pipe
21:38:47 packet_write_wait: Connection to 3ffe:ffff::dead:beef port 22: Broken pipe
21:54:24 packet_write_wait: Connection to 3ffe:ffff::dead:beef port 22: Broken pipe
$ for TRIAL in `seq 1 5`; do cat <dd.$TRIAL.log; echo; done
190336000 bytes (190 MB, 182 MiB) copied, 925.446 s, 206 kB/s
190317568 bytes (190 MB, 182 MiB) copied, 925.541 s, 206 kB/s
190258176 bytes (190 MB, 181 MiB) copied, 925.136 s, 206 kB/s
190503936 bytes (191 MB, 182 MiB) copied, 926.104 s, 206 kB/s
190619648 bytes (191 MB, 182 MiB) copied, 926.24 s, 206 kB/s
On the remote side, the executed commands still hang, so the server does not detect that the session has dropped. This tells me this is a client issue:
Code:
$ ssh remotehost ps ax | grep -F 'cat >/dev/null'
6999 ? Ss 0:00 bash -c cat >/dev/null
13084 ? Ss 0:00 bash -c cat >/dev/null
13425 ? Ss 0:00 bash -c cat >/dev/null
13593 ? Ss 0:00 bash -c cat >/dev/null
13779 ? Ss 0:00 bash -c cat >/dev/null
If I rate-limit the data I send to SSH, the SSH sessions no longer drop, so I use that as my workaround until I have found a solution:
Code:
# ip6tables -D INPUT -p tcp -m hashlimit --hashlimit-above 200kb/s -m tcp --destination 3ffe:ffff::dead:beef --dport 22 -j DROP
# ip6tables -A INPUT -p tcp -m hashlimit --hashlimit-above 300kb/s -m tcp --destination 3ffe:ffff::dead:beef --dport 22 -j DROP
$ while TRIAL in `seq 6 10`
> do
> yes | dd status=progress if=/dev/stdin bs=1k count=$((500*1024)) 2>dd.$TRIAL.log |
> pv -q -L 200k | ssh -vvv remotehost 'cat >/dev/null' 2>&1 |
> while read LINE
> do
> printf '%s\t%s\n' `date +%H:%M:%S` "$LINE"
> done | tee output.$TRIAL.log
> done
$ tail <output.6.log
22:48:14
22:48:14 debug1: channel 3: free: port listener, nchannels 1
22:48:14 debug3: channel 3: status: The following connections are open:
22:48:14
22:48:14 debug1: fd 0 clearing O_NONBLOCK
22:48:14 debug1: fd 1 clearing O_NONBLOCK
22:48:14 debug1: fd 2 clearing O_NONBLOCK
22:48:14 Transferred: sent 524986928, received 94512 bytes, in 2925.9 seconds
22:48:14 Bytes per second: sent 179429.8, received 32.3
22:48:14 debug1: Exit status 0
$ for TRIAL in `seq 6 10`; do tail -n 1 <dd.$TRIAL.log; done
524288000 bytes (524 MB, 500 MiB) copied, 2919.03 s, 180 kB/s
524288000 bytes (524 MB, 500 MiB) copied, 2559.03 s, 205 kB/s
524288000 bytes (524 MB, 500 MiB) copied, 2644.5 s, 198 kB/s
524288000 bytes (524 MB, 500 MiB) copied, 2559.03 s, 205 kB/s
524288000 bytes (524 MB, 500 MiB) copied, 2559.01 s, 205 kB/s
This tells me that the SSH session drops have something to do with SSH control messages not getting through, but why would this be the case?
The TCP keep alive packets are likely dropped, but the kernel does not send the first keep alive packet until after two hours, long after my SSH sessions have dropped after 15 minutes, so this is unlikely to be the cause of my problem (/proc/sys/net/ipv4/tcp_keepalive_time
does apply to IPv6 as well as IPv4):
Code:
$ cat /proc/sys/net/ipv4/tcp_keepalive_time
7200
The SSH client is set to time out if it does not receive response to three consecutive SSH keep alive messages, which are sent in 5 minute intervals. According just to the timing, this seems to be a much more likely cause, but the SSH debug output above does not indicate that these messages are being sent by the client. There is only a bunch of messages at the beginning (authentication, opening a channel,
etc.) and then a disconnect message after 15 minutes with nothing in between:
Code:
$ grep 'send packet' <output.1.log
20:36:33 debug3: send packet: type 20
20:36:33 debug3: send packet: type 30
20:36:33 debug3: send packet: type 21
20:36:33 debug3: send packet: type 5
20:36:33 debug3: send packet: type 50
20:36:33 debug3: send packet: type 50
20:36:33 debug3: send packet: type 50
20:36:33 debug3: send packet: type 90
20:36:33 debug3: send packet: type 80
20:36:33 debug3: send packet: type 98
20:36:33 debug3: send packet: type 98
20:51:58 debug3: send packet: type 1
Not only that, but if I replace the remote command cat >/dev/null with tee >/dev/null, the output is echoed back to me, but the SSH session still drops, so there does not seem to be an issue of not being able to receive server responses. In the opposite direction, the server does not send SSH keep alive messages at all:
Code:
$ cat .ssh/config
Host remotehost
User username
Hostname 3ffe:ffff::dead:beef
ControlMaster auto
ControlPath /var/tmp/remotehost.socket
TCPKeepAlive yes
ServerAliveInterval 300
ServerAliveCountMax 3
$ ssh remotehost cat /etc/ssh/sshd_config
PasswordAuthentication no
TCPKeepAlive yes
ClientAliveInterval 0
I will appreciate your input. After reading
the sticky, I figured this question would qualify as a networking question, because the issue is related to finetuning iptables and SSH, but please let me know if you feel the question would feel more at home elsewhere.