LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   kernel hang on 'connect' to local socket (https://www.linuxquestions.org/questions/programming-9/kernel-hang-on-connect-to-local-socket-690729/)

ta0kira 12-15-2008 11:14 AM

kernel hang on 'connect' to local socket
 
I have a program that serves as both a client and a server; a server instance runs as a daemon and places a local socket in a specific directory on the file system, and a client instance connects to the socket of a particular server instance for communication. A third mode allows the user to list which servers he/she has access to. This is done by the program scanning the socket directory and attempting to connect to each socket present. Sockets that aren't active are removed and sockets that can't be connected to are ignored (that means another user ignores the socket.) This is a setuid program, but I've removed those parts for the example.

I've isolated the problem from a much larger program for this post. This part of the program has worked perfectly on my development machine, but since I installed the program on another computer it always crashes the kernel on the connect call. This only happens when the socket is one that can be connected to. I actually haven't tried connecting as a user that doesn't have access to a socket, but I don't want to crash my kernel mid-post. I'll try it and post an edit. Here is the code:
Code:

#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/un.h>
#include <errno.h>
#include <stddef.h>
#include <signal.h>


static int resolve_existing_entry(const char *nName)
{
    struct sockaddr_un new_address;
    size_t new_length = 0;

    int new_socket = socket(PF_LOCAL, SOCK_STREAM, 0);
    if (new_socket < 0) return -1;

    int current_state = fcntl(new_socket, F_GETFL);
    fcntl(new_socket, F_SETFL, current_state | O_NONBLOCK);

    new_address.sun_family = AF_LOCAL;
    strncpy(new_address.sun_path, nName, sizeof new_address.sun_path);

    new_length = (offsetof(struct sockaddr_un, sun_path) + SUN_LEN(&new_address) + 1);

    int connected = 0;

    //the line below causes a kernel hang only if connection is possible
    if (connect(new_socket, (struct sockaddr*) &new_address, new_length) < 0)
    {
    if (errno != EINPROGRESS && errno != EALREADY) remove(nName);
    }
    else connected = 1;

    shutdown(new_socket, SHUT_RDWR);

    struct stat current_stats;
    return (stat(nName, &current_stats) >= 0)? ((connected)? -1 : -2) : 0;
}


static int show_table_entry(const struct dirent *eEntry)
{
    if (eEntry && eEntry->d_type == DT_SOCK)
    {
    int connected = resolve_existing_entry(eEntry->d_name);
    if (connected == -1) fprintf(stdout, "%s\n", eEntry->d_name);
    return 0;
    }

    else if (eEntry && eEntry->d_type == DT_DIR) return 0;

    else return 1;
}


int main()
{
    struct dirent **entries = NULL, **current = NULL;

    int total_matches = scandir(".", &entries, &show_table_entry, NULL);

    return 0;
}

It is very possible that the other end of the program is causing the hang, but I haven't had a chance to isolate that part of it. I'll try to isolate a part of that code to see. Until then, please tell me if you see anything unsafe or incorrect about my example code. Thank you.
ta0kira

edit:
It appears that the hang is only when the program has permission to connect; therefore, it's probably the server end. It doesn't appear to be a problem with the accept code, so I'm looking at a particular part where select is used.

ta0kira 12-15-2008 01:00 PM

It turns out the problem was with cross-thread synchronization on the server end. How it's set up is one thread runs select on the socket and another thread accepts, and if there's nothing to accept then the accepting thread blocks on a pthread condition. I do this so that more than one thing can cause the accept thread to resume; that's why the select call isn't in that thread. When the select thread gets a read availability then it sends a broadcast to the pthread condition, causing the accept thread to resume and accept the connection. What apparently was happening on this machine was the select thread got all the way back to select again before the accept thread executed accept, so it went around another iteration. For some reason this caused a hang, but I can't decide why (eventually accept would have cleared the read availability.) Adding a nanosleep just after the condition broadcast took care of it. I'll keep taking a look at it.
ta0kira

PS It appears to be something extremely subtle, probably sensitive to a single-processor machine. I'm sure it had to do with an implicit signed/unsigned conversion somewhere, but it's fixed now.

ta0kira 12-16-2008 03:47 AM

I actually found the problem in the most obscure place: a source file that's about 20 lines long (in a completely different program.) It was a technique I use to break a blocked select call by placing half of a dummy pipe in the "read" fd_set. To break the select, I write a byte and wait about 1ms, then read it. What seemed to have happened was something else read the byte, so the read blocked (don't understand the hang, though,) so I changed the read end to non-blocking and it works perfectly (as it did on my main machine.) In case anyone else has this extremely obscure problem. That was a waste of 10 hours...
ta0kira

PS I stopped the hang by setting ulimit -t 3 in the terminal so the kernel killed the process after 3 seconds of hogging the processor. That let me debug it.


All times are GMT -5. The time now is 06:17 AM.