cannot connect to self ... not even localhost

Skaperen · 12-08-2011, 03:08 PM

I've seen this before a few months ago. It just happened again.

A machine cannot connect or ping itself. This affects EVERY IP address (both v4 and v6) on every interface. It also include localhost (127.0.0.1 and ::1). Everything works OK when connecting from and to other hosts on the LAN on IPs that should work that way.

When I run tcpdump to see any traffic, I see only the traffic from the LAN. All pings and attempts to connect to any address on the machine simply do not show up in the tcpdump output. This was tried with both "-i any" and "-i <interface>".

I try taking interfaces down and back up and the problem persists. I remove routes and put them back and the problem persists. I reboot and it comes back up just fine.

In the previous instance I saw, it happened about 40 days or so into the uptime. This time it happened about an hour after a reboot. Any ideas what kind of kernel state would do this (and maybe how to fix it without a reboot)?

T3RM1NVT0R · 12-08-2011, 04:16 PM

Hi Skaperen,

As I can understand when this happen on the machine (I will call it as mainmac) you are able to connect to mainmac from other systems and mainmac is able to connect to other system. However, mainmac is not able to connect to itself. During this period mainmac cannot ping itself on any of the interfaces whether it is ipv4 interface or ipv6 interface and cannot even ping itself on lo.

It would be great if you could let us know the following:

1. Output of ifconfig command at that time.
2. Output of route command at that time.
3. Output of tracepath. Where you are tracing the path to itself so it will be tracepath mainmac
4. Output of ping command. Where you are pinging mainmac using IP address and then using hostname.

As you said that reboot fixes the problem. Did you try restarting the network using service network restart?

If service network restart works then you can automate that process using script. You can write an script that will ping mainmac from itself every 30 minutes and if it will find that ping is not getting a response then issue service network restart command. However, I would not consider that as a solution rather it is a work around.

Skaperen · 12-08-2011, 11:44 PM

Quote:

Originally Posted by T3RM1NVT0R

Hi Skaperen,

As I can understand when this happen on the machine (I will call it as mainmac) you are able to connect to mainmac from other systems and mainmac is able to connect to other system. However, mainmac is not able to connect to itself. During this period mainmac cannot ping itself on any of the interfaces whether it is ipv4 interface or ipv6 interface and cannot even ping itself on lo.

This is a correct description of the problem.

Quote:

Originally Posted by T3RM1NVT0R

It would be great if you could let us know the following:

1. Output of ifconfig command at that time.
2. Output of route command at that time.
3. Output of tracepath. Where you are tracing the path to itself so it will be tracepath mainmac
4. Output of ping command. Where you are pinging mainmac using IP address and then using hostname.

The ifconfig and route command outputs were identical to their usual outputs, except for the statistical parts.

I did not do tracepath. I did do traceroute. The output showed nothing responding starting at hop 1 and going on. Basically "1 * * *" and "2 * * *" and so on.

I did "ping 127.0.0.1" and "ping6 ::1". Both simply had no output. There was not even any output saying "no route to <whatever>". The output was if all packets went into a black hole.

As mentioned before, I did tcpdump even while doing the pings, and there was nothing captured except traffic to/from outside the machine. So if I pinged another host, I'd see the ICMP going out to the other host. If I pinged any local address, there was nothing at all. It is as if whatever was discarding packets, it was doing it before the point where tcpdump captures.

There were no netfilter rules in place.

Quote:

Originally Posted by T3RM1NVT0R

As you said that reboot fixes the problem. Did you try restarting the network using service network restart?

I was not willing to risk that since I was reaching the host from remote. These kinds of things can kill your access AND fail to finish restarting it.

In past times this has happened, they were local machines. Today it was a remote machine that people needed to be using. So my focus was on makings sure I didn't end up with a dead box. Unfortunately, I don't have a power cycling hookup on there, yet (it's coming, eventually).

Also, I limited the manual use of ifconfig to not disabling anything. One thing I did was change one of the IP addresses. The machine has 2 interfaces with the same IP on both, and both ports are connected to the same switch. It's merely "dumb fallback", not bonding or anything like that. I used ifconfig to change eth1 from the IP it normally has to a different IP. After doing that, neither IP would respond in-host, while the new IP on eth1 did respond to a ping from another host on the same LAN.

Maybe I could have taken eth1 down and brought it back up. I'll try to remember to do that if the next time it happens on a machine with 2 or more interfaces (so far every incident is on a different machine). But it seemed to me to NOT be an interface issue, since it affected eth0, eth1, and lo, all at the same time.

Quote:

Originally Posted by T3RM1NVT0R

If service network restart works then you can automate that process using script. You can write an script that will ping mainmac from itself every 30 minutes and if it will find that ping is not getting a response then issue service network restart command. However, I would not consider that as a solution rather it is a work around.

Agreed that such a thing is a workaround.

I'm wondering if network traffic to self is a special case handled my some special code to "turn the packets around 180 degrees to come back". I do know SOME routers don't support connecting to self on real interfaces, requiring self connection to always be done by localhost or 127.0.0.1. The explanation is that it's faster to not bother testing outgoing packets to see if they are destined to local, since the router is primarily existing to handle packets from/to physical networks. But apparently host systems (Unixes, BSDs, Linux, and even Windows) support this. But is this support carried out with special tests and such. I don't know. That's what I'm curious about.

So right now my curiosities are:

1. Has anyone ever seen this happen before?

2. Is there any special case code that handles just this traffic that could for some reason decide to not do it anymore? Could it be this happens if the "lo" interface "dies" even if other interfaces have the IP address in question?

I'm thinking of trying to set up a test on a machine I have a console on to see if I can disable "lo" in some way that causes it to stop allowing self connections/pings on any IP, but still allows packets to/from other machines.

Skaperen · 12-09-2011, 07:55 AM

If I bring down the "lo" interface (e.g. do "ifconfig lo down"), then this exact problem happens. I don't know if this is the only cause, or the cause of the events I've seen, but this definitely is a possible cause. It definitely did not appear to be down in the most recent event, but some other aspect of the "lo" interface could have become non-functional. I'll try to remember this for next time and try taking "lo" down and back up when it happens to see if that corrects it (and focus on any details of the "lo" interface).

All communication to/from other hosts continues to work when "lo" is down.

Note that if "lo" is down, packets directed to the IP address of any other interface also do not work. So this is an answer to one of my questions ... "is there a special mechanism for packets destined to any local IP address regardless of the interface". Well, they somehow are made to go through, or check, the "lo" interface.

Cedrik · 12-09-2011, 12:55 PM

It isn't a problem at application level ? I mean /etc/hosts is ok ?

Maybe there are some hints in /var/log/messages

T3RM1NVT0R · 12-09-2011, 02:58 PM

Hi Skaperen,

You answered your query yourself :-)

The test you did was perfect related to loopback adapter. So, basically if there is no device with the IP range 127.x.x.x up on the system then the system will not ping itself. The only reason I could think why this happens because loopback adapter is responsible for presenting IP address information to machine itself. So if lo is down then the machine is unaware about the IPs that are there on different interfaces.

This is a test that I did just a bit changed from what you have done. I turned down lo and then tried to ping the machine and it was unable to ping itself but I was able to connect to it from outside. Then I copied ifcfg-lo file to ifcfg-lo1. Like following:

Code:

cp /etc/sysconfig/network-scripts/ifcfg-lo /etc/sysconfig/network-scripts/ifcfg-lo1

and then I brought up ifcfg-lo1 instead of ifcfg-lo

Code:

ifup lo1

and then I was able to ping the machine from itself again.

The above mentioned method you can use as work around in the scenario where you are not able to ping the machine from itself. Instead of touching the existing configuration you can simple bring up another interface ifcfg-lo1

However, we need to find out why lo exited. This information we can get from /var/log/messages file. If you pasted the output of following:

Code:

cat /var/log/messages

Then we can work together to find out the reason why this happened.

Skaperen · 12-11-2011, 01:17 AM

Quote:

Originally Posted by Cedrik

It isn't a problem at application level ? I mean /etc/hosts is ok ?

Maybe there are some hints in /var/log/messages

If it was that, it would be persistent across reboots. Instead, it just suddenly happens with no apparent state. /etc/hosts was not changed. It wouldn't be used for IP addresses, anyway. I saw nothing in /var/log/messages or /var/log/syslog.

Quote:

Originally Posted by T3RM1NVT0R

Hi Skaperen,

You answered your query yourself :-)

The test you did was perfect related to loopback adapter. So, basically if there is no device with the IP range 127.x.x.x up on the system then the system will not ping itself. The only reason I could think why this happens because loopback adapter is responsible for presenting IP address information to machine itself. So if lo is down then the machine is unaware about the IPs that are there on different interfaces.

This is a test that I did just a bit changed from what you have done. I turned down lo and then tried to ping the machine and it was unable to ping itself but I was able to connect to it from outside. Then I copied ifcfg-lo file to ifcfg-lo1. Like following:

Code:

cp /etc/sysconfig/network-scripts/ifcfg-lo /etc/sysconfig/network-scripts/ifcfg-lo1

and then I brought up ifcfg-lo1 instead of ifcfg-lo

Code:

ifup lo1

and then I was able to ping the machine from itself again.

The above mentioned method you can use as work around in the scenario where you are not able to ping the machine from itself. Instead of touching the existing configuration you can simple bring up another interface ifcfg-lo1

However, we need to find out why lo exited. This information we can get from /var/log/messages file. If you pasted the output of following:

Code:

cat /var/log/messages

Then we can work together to find out the reason why this happened.

But "lo" did not go down in the actual incident. I was merely able to get the same thing to happen by taking "lo" down manually. But maybe the kernel data structures related to interface "lo" got corrupt somehow or in some other way "lo" entered an unworkable or unusable state?

The machine was rebooted just before 14:41:20 and that timestamp was the last of the kernel bootup messages. One of the developers logged in at 15:24 and noticed that he could not get connected to an emulator he had just started. He emailed me about it and I logged in at 15:42 as root to see what was happening. I could see the problem, but not any particular cause. The command "ifconfig -a" showed nothing out of the ordinary. I did some console flipping (e.g. what astronomers used to do with a mirror and a pair of photos to spot moving objects) between the affected machine and a like machine that was not affected where I had also done "ifconfig -a". There was no different but in the expected places (e.g. different IPs, usage stats, MAC, etc). I also did the same with "route -n" output. Then I rebooted at 16:01. The /var/log/messages content between 14:41:20 and 16:01 showed ONLY some messages about eth0 and eth1 entering promiscuous mode (because I ran tcpdump to check if it showed the missing packets) and some usual messages from the emulators (not running as root) about unrecognized ioctl() calls (they produce a lot of those). They should not be able to effect network interfaces unless there is a bug somewhere.

In two earlier incidents, one happening on one of these servers, and before that on my desktop at home, there was a substantial amount of time from kernel reboot to incident (around 40 days for the server case, and probably longer for my desktop at home). In the most recent case it was shortly after the reboot that it happened and very little was going on. The developer that logged in had started the emulator then noticed the issue. If the activity of the emulator caused it, it happened in a very short window this time (but I'll need to ask him how long after starting the emulator to when it noticed the problem to narrow the time frame down smaller).

I know the kernel does a lot of juggling of packets in something called SK buffers and such, with many queues. Could some thread that was picking up these buffers from a queue for "lo" have died or gotten blocked?

T3RM1NVT0R · 12-11-2011, 07:49 AM

Quote:

But "lo" did not go down in the actual incident. I was merely able to get the same thing to happen by taking "lo" down manually. But maybe the kernel data structures related to interface "lo" got corrupt somehow or in some other way "lo" entered an unworkable or unusable state?

The machine was rebooted just before 14:41:20 and that timestamp was the last of the kernel bootup messages. One of the developers logged in at 15:24 and noticed that he could not get connected to an emulator he had just started. He emailed me about it and I logged in at 15:42 as root to see what was happening. I could see the problem, but not any particular cause. The command "ifconfig -a" showed nothing out of the ordinary. I did some console flipping (e.g. what astronomers used to do with a mirror and a pair of photos to spot moving objects) between the affected machine and a like machine that was not affected where I had also done "ifconfig -a". There was no different but in the expected places (e.g. different IPs, usage stats, MAC, etc). I also did the same with "route -n" output. Then I rebooted at 16:01. The /var/log/messages content between 14:41:20 and 16:01 showed ONLY some messages about eth0 and eth1 entering promiscuous mode (because I ran tcpdump to check if it showed the missing packets) and some usual messages from the emulators (not running as root) about unrecognized ioctl() calls (they produce a lot of those). They should not be able to effect network interfaces unless there is a bug somewhere.

In two earlier incidents, one happening on one of these servers, and before that on my desktop at home, there was a substantial amount of time from kernel reboot to incident (around 40 days for the server case, and probably longer for my desktop at home). In the most recent case it was shortly after the reboot that it happened and very little was going on. The developer that logged in had started the emulator then noticed the issue. If the activity of the emulator caused it, it happened in a very short window this time (but I'll need to ask him how long after starting the emulator to when it noticed the problem to narrow the time frame down smaller).

I know the kernel does a lot of juggling of packets in something called SK buffers and such, with many queues. Could some thread that was picking up these buffers from a queue for "lo" have died or gotten blocked?

As you said that when you logged in there was nothing unusual. But there has to be something which have caused this. You said that the machine got rebooted at 14:41 and developer ran emulator which I am sure require self communication to the machine failed at 15:42. It means that something went wrong between this time frame. When machine got rebooted at 14:41 was that through cron job or you rebooted it? If it was you who rebooted it then did you check if you are able to ping all interfaces from machine itself.

It will be great if you could paste/attach the logs from 14:41 and 15:24. It will be a good idea to make a copy of ifcfg-lo to ifcfg-lo1 so that when next time this happens we can try to bring ifcfg-lo1. Because if that resolves the issue then there is nothing wrong with loopback tcp stack and we can look into other directions. But if that does not resolve the issue then there is something wrong in the way kernel tries to handle the tcp stack for loopback and thus causing it to exit.

Skaperen · 12-11-2011, 03:32 PM

When I did the reboot before 14:42 I merely verified the machine was up by logging in. I could get in, so as far as the intended quick check was concerned, the reboot succeeded. I didn't have cause to check if the machine could ping itself. After notification of the trouble, then I logged in again. This time I checked and found that I could not make the contact the developer reported (e.g. telnet to the machine by its name). I then tried other ways (using the IP address, by "localhost", by 127.0.0.1, by ::1, and the same for the other interface). I tried connections, pings, and even traceroute (hoping maybe it would see the packets at least trying to go somewhere). None of the "to self" connections/pings worked. I tried pinging other hosts and that worked. I tried connecting from other hosts, despite being in which indicated this worked in at least one case, and those worked. So I started looking around at "route -n" and "ifconfig -a" outputs. I even compared them to another machine and could see no important difference.

There was some urgency to get things working. So at that point I tried the reboot, know it fixed it in past cases (though I didn't do as much looking around in the past).

If it happens again, I'm certainly going to look for more information and try to get it recorded. I posted about this to see if it had been heard of before, or if analysis by someone that understood the network infrastructure inside the kernel could suggest possible causes. And I will focus on the state of "lo" next time.

For reasons of confidentiality, I would have to redact all the emulator messages in the logs. And that leaves nothing between when the earlier reboot finished and the latter reboot started, besides attempts to do tcpdump on ethernet interfaces caused the messages about promiscuous mode. There's really nothing there to work from.