help debugging a socket leak

rozeboom · 05-06-2004, 10:44 AM

I'm trying to debug a program using lsof. I've been seeing a file-descriptor leak in /proc/$PID/fd from sockets that eventually keeps me from being able to create new network connections when I hit the resource limit.

lsof shows me that the socket handles that are being leaked are identified with "TYPE=sock 0,0 ...can't identify protocol" I've performed an strace and there are never any calls made (network or otherwise) that return the file-descriptors that match the bad ones in lsof.

My question is, where are these "sock" type sockets being created? Does anyone else know where they come from and how to get rid of them?

Thanks!

infamous41md · 05-06-2004, 01:32 PM

can u post the code?

rozeboom · 05-06-2004, 01:34 PM

I couldn't post the code of the program in question, but I can see if I can create an example.

infamous41md · 05-06-2004, 01:44 PM

the obvious answer is that you're not closing your sockets when you're done with them. if u need them all to be open, check out setrlimit() and i think u can change max # open descriptors.

rozeboom · 05-06-2004, 01:49 PM

I've been down that road. I've managed to account for the opening and closing of all of the socket calls my program makes. Strace helps there. These socket handles which are being allocated do not seem be be coming from my own code, but rather internally within the system. With Strace I can see every socket I allocate and my program never allocates these. Since my program is not allocating them, I don't know what they are to close them down.

rozeboom · 05-06-2004, 01:57 PM

And I don't need more than a few dozen sockets open for any reason. These leaked "blank" sockets appear everytime I make a new network connection. Over time, I run out of resources so raising the limit would only delay the inevitable.

infamous41md · 05-06-2004, 02:01 PM

well, that's certainly odd. from stracing u have no idea what part of the code they are coming from? are you using lots of libraries? if you are doing some sort of nameresolution call before connecting(ie. gethostbyname, etc..), then of course there will be a few scokets opened for dns resolution, but i certainly hope that is not the problem or others would have encourntered already. w/o seeing code i dont know what else to tell you. how long does it take for u to max out? what kinda program is this?

rozeboom · 05-06-2004, 02:28 PM

Here's the lsof output, if it helps:
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
DoubleTak 2454 root cwd DIR 3,3 4096 1188252 /opt/NSI
DoubleTak 2454 root rtd DIR 3,3 4096 2 /
DoubleTak 2454 root txt REG 3,3 8059307 1188473 /opt/NSI/DoubleTake
DoubleTak 2454 root mem REG 3,3 103044 1300532 /lib/ld-2.3.2.so
DoubleTak 2454 root mem REG 3,3 79744 682809 /lib/tls/libpthread-0.29.so
DoubleTak 2454 root mem REG 3,3 91604 1300547 /lib/libnsl-2.3.2.so
DoubleTak 2454 root mem REG 3,3 23668 1300541 /lib/libcrypt-2.3.2.so
DoubleTak 2454 root mem REG 3,3 366424 1300587 /lib/libacl.so.1.1.0
DoubleTak 2454 root mem REG 3,3 15084 1300543 /lib/libdl-2.3.2.so
DoubleTak 2454 root mem REG 3,3 710608 374097 /usr/lib/libstdc++.so.5.0.3
DoubleTak 2454 root mem REG 3,3 211948 682807 /lib/tls/libm-2.3.2.so
DoubleTak 2454 root mem REG 3,3 30324 1300589 /lib/libgcc_s-3.2.2-20030225.so.1
DoubleTak 2454 root mem REG 3,3 49287 1300585 /lib/libattr.so.1.1.0
DoubleTak 2454 root mem REG 3,3 728579 1301648 /lib/libRSResource.so
DoubleTak 2454 root mem REG 3,3 1531064 682805 /lib/tls/libc-2.3.2.so
DoubleTak 2454 root 0u CHR 1,3 66759 /dev/null
DoubleTak 2454 root 1u CHR 5,1 65323 /dev/console
DoubleTak 2454 root 2u CHR 5,1 65323 /dev/console
DoubleTak 2454 root 3u CHR 5,1 65323 /dev/console
DoubleTak 2454 root 4uW REG 3,3 0 229150 /tmp/Double-Take
DoubleTak 2454 root 5u REG 3,3 5750 1188475 /opt/NSI/dtlog1.dtl
DoubleTak 2454 root 6u IPv4 3707 UDP *:1575
DoubleTak 2454 root 7u sock 0,0 3691 can't identify protocol
DoubleTak 2454 root 8u IPv4 3696 UDP *:1578
DoubleTak 2454 root 9u IPv4 3697 UDP 169.254.1.247:1575
DoubleTak 2454 root 10u IPv4 3698 UDP 169.254.1.247:32769
DoubleTak 2454 root 11u IPv4 3701 UDP *:1578
DoubleTak 2454 root 12u IPv4 3702 UDP 10.0.21.154:1575
DoubleTak 2454 root 13u IPv4 3703 UDP 10.0.21.154:32770
DoubleTak 2454 root 14u IPv4 3711 TCP *:1578 (LISTEN)
DoubleTak 2454 root 15u sock 0,0 3714 can't identify protocol
DoubleTak 2454 root 16u sock 0,0 3717 can't identify protocol
DoubleTak 2454 root 17u unix 0xd7443280 3719 socket
DoubleTak 2454 root 18u sock 0,0 3721 can't identify protocol

rozeboom · 05-06-2004, 02:32 PM

This is a backup program which transmits changes to a backup server. The problem occurs with each connection I create. In the example I posted, handles 15, 16, & 18 display the problem. The speed with which I run out of resource depends on how many times I reconnect...in a test environment that can be many times an hour.

infamous41md · 05-06-2004, 02:44 PM

well, im baffled. all the output tells me, as u prolly know, is that the problem lies in the area of code right after creating the TCP *:1578 (LISTEN) socket, since descriptors are always assigned from next lowest open #.

rozeboom · 05-06-2004, 02:57 PM

Yeah, and the code of this program is complex enough its like a needle in a haystack. Even strace would only point me at a specific system call, but I was hoping that I could identify the code based on the parameters being passed, etc...

I suspect, as you mentioned earlier, that there are some internal calls which use sockets that may not be getting cleaned up. Thanks for your reponse. Perhaps someone else will come along who has my answer. I'm going to keep trying to recreate this with simpler code or try to hunt it down in some library code, or something...

Toonces7 · 08-10-2004, 11:58 PM

Okay, I don't know if anyone's still reading this, but I had exactly the same problem rozeboom describes here and I was able to fix it.

My problem was an incorrect call to socket(). This code was written by someone else but I had to debug it. The problem was that this code failed to recognize that the function accept() opens and creates the socket without needing to call socket()

So this code was calling accept(), AND calling socket() when it should have not called socket() if it were an fd optained from accept(). FDs optained from accept() are already created and inherit their socket-parameters from the socket where accept() was called from.

In my particular case, the code's custom socket class was calling socket() from its constructor. It was doing this on sockets that'd had already been created via accept(), thus two FDs created, one of them never getting cleaned up. So my fix was to make a second version of the constructor for my socket class, one that takes an FD as a parameter. That constructor does NOT call socket(). I use this constructor to create socket class that originate from the accept() call.

It seems to work. Hope this is of use to someone out there

-Aaron

rozeboom · 08-11-2004, 09:19 AM

That sounds like it might help... the code I'm working with is also a C++ class for handling sockets and it might very well do exactly what you describe. Thanks for your response!