What could be causing SYN_ACK's to be delayed from certain clients?
Linux - NetworkingThis forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
What could be causing SYN_ACK's to be delayed from certain clients?
Hi,
I'm having a problem with TCP connections taking a long time to establish, often timing out. The strange thing is it seems to only happen on some source workstations and not others in our LAN but not from home or other internet connections.
I can see the SYN packet leave the source workstation, enter the firewall LAN interface, leave the firewall WAN interface and arrive at the destination server's interface. Then the SYN ACK packet doesn't get sent for 2 to 300 seconds. Once I see the SYN ACK sent from the server, the connection is ESTABLISHED.
I'm hoping someone can help me diagnose the source of the problem.
Are you able to sniff packets on the server side? Or are you just assuming the SYN packet arrived there?
If the former, this will be a lot easier to troubleshoot. What application(s) are you seeing this behavior with? Can you reproduce the problem with an arbitrary netcat listener on the server? See the nc(1) manpages if you're not familiar with it.
Applications are SSHD and HTTPD, possibly others, but I tried a nc connection a number of times like you suggested using port 6686 and it's doing the same thing (see tcpdumps below) some times connecting others not.
I am able to run tcpdump on the destination server and I do see the SYN packet arrive, but no SYN ACK leave for some time, as you can see in the successful connection dump, it takes 21 seconds to respond with the SYN ACK.
It doesn't seem to happen with windows source machines though, which makes absolutely no sense to me, I'm seeing it on Linux and Mac OSX machines.
Timed out - tcpdump
Code:
17:05:21.545294 IP source.interface.net.53219 > destination.interface.com.6686: S 1019474965:1019474965(0) win 5840 <mss 1452,sackOK,timestamp 104729087 0,nop,wscale 3>
17:05:24.608754 IP source.interface.net.53219 > destination.interface.com.6686: S 1019474965:1019474965(0) win 5840 <mss 1452,sackOK,timestamp 104732087 0,nop,wscale 3>
17:05:30.569673 IP source.interface.net.53219 > destination.interface.com.6686: S 1019474965:1019474965(0) win 5840 <mss 1452,sackOK,timestamp 104738087 0,nop,wscale 3>
17:05:42.655230 IP source.interface.net.53219 > destination.interface.com.6686: S 1019474965:1019474965(0) win 5840 <mss 1452,sackOK,timestamp 104750087 0,nop,wscale 3>
17:06:06.534508 IP source.interface.net.53219 > destination.interface.com.6686: S 1019474965:1019474965(0) win 5840 <mss 1452,sackOK,timestamp 104774087 0,nop,wscale 3>
17:06:54.527179 IP source.interface.net.51545 > destination.interface.com.6686: S 1019474965:1019474965(0) win 5840 <mss 1452,sackOK,timestamp 104822087 0,nop,wscale 3>
Successful - tcpdump
Code:
17:09:24.224013 IP source.interface.net.61725 > destination.interface.com.6686: S 1263216290:1263216290(0) win 5840 <mss 1452,sackOK,timestamp 104971796 0,nop,wscale 3>
17:09:27.322690 IP source.interface.net.61725 > destination.interface.com.6686: S 1263216290:1263216290(0) win 5840 <mss 1452,sackOK,timestamp 104974797 0,nop,wscale 3>
17:09:33.221183 IP source.interface.net.61725 > destination.interface.com.6686: S 1263216290:1263216290(0) win 5840 <mss 1452,sackOK,timestamp 104980797 0,nop,wscale 3>
17:09:45.261955 IP source.interface.net.61725 > destination.interface.com.6686: S 1263216290:1263216290(0) win 5840 <mss 1452,sackOK,timestamp 104992797 0,nop,wscale 3>
17:09:45.262008 IP destination.interface.com.6686 > source.interface.net.61725: S 4059602422:4059602422(0) ack 1263216291 win 5792 <mss 1460,sackOK,timestamp 2534800086 104992797,nop,wscale 7>
17:09:45.455884 IP source.interface.net.61725 > destination.interface.com.6686: . ack 1 win 730 <nop,nop,timestamp 104993033 2534800086>
What about PING, IPv6, firewall, filters, routing ...
What difference tcpdump shows when connection is made by win.?
What about other services? TELNET sessions?
I'd look to see if there is any icmp traffic coming back.
Also, I'd use mtr and tcptraceroute to see if there is any odd behavior. I would also run those commands from the server back toward the source systems.
Offhand, seeing the traffic arrives at your server, and isn't being returned promptly I'd be inclined to believe your server is is the problem. I've seen similar activity on networks where there are routing conflicts, but seeing the Windows systems you have tested with do not exhibit this problem I doubt that is the issue.
I'd also look into whatever the service the server is running for clues. Check the log files it has, or if that isn't enough, enable debugging.
Hmmnn..something else that comes to mind, although it's just a shot in the dark -- does the environment where your server is located have any sort of an IDS/IPS?
To follow up on what has been said.
Pings work all the time from all source machines to all destination servers.
Neither source or destination machines have IPv6 enabled.
Telnetd is not running on the destination. But I think the netcat test indicated it was an issue at the TCP level rather than at application.
Apart from iptables on the destination server I doubt there are any IDS/IPS systems in front.
To add even more confusion I have some source linux server (metal and virtual machine) (all linux are centos 5, including the destinations) which seems to always connect on all protocols through the same gateway. While if I change the gateway on the problem source machines to our secondary slow internet link, they work.
As the source machine is behind NAT I cannot trace all the way back to the source machine but to the WAN interface on the firewall.
mtr did show high packet loss coming from the destination back to the source network, which is strange. But it doesn't explain why the server isn't sending the SYN ACK. It would if it was sending them and they weren't arriving.
Can you temporarily flush your (server) iptables ruleset and test again to see if the problem persists?
Have you been tweaking any sysctl MIBs on the server?
---
edit: Just noticed this:
Quote:
To add even more confusion I have some source linux server (metal and virtual machine) (all linux are centos 5, including the destinations) which seems to always connect on all protocols through the same gateway. While if I change the gateway on the problem source machines to our secondary slow internet link, they work.
Argh. So if I understand correctly, going over a particular network path (ISP #1) the problem exists, but over a different network path (ISP #2), the problem does not. Right?
To follow up on what has been said.
Pings work all the time from all source machines to all destination servers.
When I mentioned icmp, I wasn't referring to pings so much as networking messages being returned. If there is an MTU issue or something of that nature you could see icmp messages being returned on the source side. If you don't allow that traffic through your firewall, you might not see it. Might be worth checking the public interface on the firewall to see if there is any related traffic.
Quote:
Neither source or destination machines have IPv6 enabled.
Telnetd is not running on the destination. But I think the netcat test indicated it was an issue at the TCP level rather than at application.
Apart from iptables on the destination server I doubt there are any IDS/IPS systems in front.
I agree with you on the netcat test indicating it is more than likely a tcp/network related issue.
Quote:
To add even more confusion I have some source linux server (metal and virtual machine) (all linux are centos 5, including the destinations) which seems to always connect on all protocols through the same gateway. While if I change the gateway on the problem source machines to our secondary slow internet link, they work.
As the source machine is behind NAT I cannot trace all the way back to the source machine but to the WAN interface on the firewall.
mtr did show high packet loss coming from the destination back to the source network, which is strange. But it doesn't explain why the server isn't sending the SYN ACK. It would if it was sending them and they weren't arriving.
The mtr results show some definite problems with that network. No matter what I'd be contacting my ISP and asking questions about that.
The reason I suggested tcptraceroute is that it is tcp based, whereas mtr is icmp as I recall.
And you're right about the problem with the response to the traffic. The network can't drop what isn't being sent.
On the server side, is there anywhere else the traffic could be routed? Do you have any sort of VPN tunnels to it? Are there any other interfaces with routes that could cause a problem?
nimnull22 suggested a more detailed tcpdump, but I'd take it further than what he recommended. Might as well get everything. Try this:
tcpdump -s 0 -Xevvvnni eth0 host <host IP> and port <port #>
You'll see plenty of information, but it could provide some helpful details. You might also capture traffic exiting your firewall on the source side and compare it against the traffic arriving at your server.
It looks like our ISP Telstra doesn't allow tcptracroute as it goes dead at their gateway. I'm really stumped... As i mentioned before I can see the SYN packet leaving the client arriving at the LAN interface of our router (pfsense) leave the router on the WAN port arrive at the destination server then unless a SYN ACK is sent back from that destination there is no connection. It shouldn't happen.
The only sysctl options that have been changed are:
net.ipv4.tcp_fin_timeout = 1
net.ipv4.tcp_tw_recycle = 1
The rest are default Centos 5.
Also note I have tried flushing iptables on the destination and it made no difference.
Last edited by fizzdandantilus; 02-21-2010 at 05:42 PM.
I looked over the captures you provided, and a few things stood out. One was how the MAC address to which your server responds changes.
The initial packet:
00:19:e8:e9:03:3f > 00:16:3e:79:4a:c9
Then the response:
00:16:3e:79:4a:c9 > 00:00:0c:07:ac:02
I don't know if that is something that is out of the ordinary, as the environment where your server is hosted is probably set up that way.
The other difference I noticed is the wscale values. We only have one successful capture in the thread, which shows the inbound and outbound wscale value set to 7. The connection attempts that fail show wscale values of 3, and the one where you show a response that did work after an extended period of time shows a response wscale value of 7.
I don't know if that has anything to do with this problem, but it could. I am going to do a bit of reading on the subject and see if I can gain a better understanding of it. I see a some hits on a search where others have had problems with it.
00:00:0C:07:AC:02 is the ISP gateway interface and 00:19:e8:e9:03:3f must be another of the ISP routers (multi-homed). It's the only way I can explain it. I can dig further if you think it is worth looking into further.
I can confirm that the destination host has MAC 00:16:3E:79:4A:C9.
I do not understand TCP/IP networking enough to know what wscale is for. I appreciate all your efforts to help understand this issue.
OK, so I have a number of different workstations (Mac, Linux and Windows) on my LAN. The slow to connect/timed out seem to be predominantly Macs and Linux machines. I have not seen this issue with a Windows box yet, but I don't usually use one.
Although I have one linux box that never displays the connect issues. This is the one I use in my testing and refer to in the previous post.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.