LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-15-2008, 11:14 AM   #1
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
kernel hang on 'connect' to local socket


I have a program that serves as both a client and a server; a server instance runs as a daemon and places a local socket in a specific directory on the file system, and a client instance connects to the socket of a particular server instance for communication. A third mode allows the user to list which servers he/she has access to. This is done by the program scanning the socket directory and attempting to connect to each socket present. Sockets that aren't active are removed and sockets that can't be connected to are ignored (that means another user ignores the socket.) This is a setuid program, but I've removed those parts for the example.

I've isolated the problem from a much larger program for this post. This part of the program has worked perfectly on my development machine, but since I installed the program on another computer it always crashes the kernel on the connect call. This only happens when the socket is one that can be connected to. I actually haven't tried connecting as a user that doesn't have access to a socket, but I don't want to crash my kernel mid-post. I'll try it and post an edit. Here is the code:
Code:
#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/un.h>
#include <errno.h>
#include <stddef.h>
#include <signal.h>


static int resolve_existing_entry(const char *nName)
{
    struct sockaddr_un new_address;
    size_t new_length = 0;

    int new_socket = socket(PF_LOCAL, SOCK_STREAM, 0);
    if (new_socket < 0) return -1;

    int current_state = fcntl(new_socket, F_GETFL);
    fcntl(new_socket, F_SETFL, current_state | O_NONBLOCK);

    new_address.sun_family = AF_LOCAL;
    strncpy(new_address.sun_path, nName, sizeof new_address.sun_path);

    new_length = (offsetof(struct sockaddr_un, sun_path) + SUN_LEN(&new_address) + 1);

    int connected = 0;

    //the line below causes a kernel hang only if connection is possible
    if (connect(new_socket, (struct sockaddr*) &new_address, new_length) < 0)
    {
    if (errno != EINPROGRESS && errno != EALREADY) remove(nName);
    }
    else connected = 1;

    shutdown(new_socket, SHUT_RDWR);

    struct stat current_stats;
    return (stat(nName, &current_stats) >= 0)? ((connected)? -1 : -2) : 0;
}


static int show_table_entry(const struct dirent *eEntry)
{
    if (eEntry && eEntry->d_type == DT_SOCK)
    {
    int connected = resolve_existing_entry(eEntry->d_name);
    if (connected == -1) fprintf(stdout, "%s\n", eEntry->d_name);
    return 0;
    }

    else if (eEntry && eEntry->d_type == DT_DIR) return 0;

    else return 1;
}


int main()
{
    struct dirent **entries = NULL, **current = NULL;

    int total_matches = scandir(".", &entries, &show_table_entry, NULL);

    return 0;
}
It is very possible that the other end of the program is causing the hang, but I haven't had a chance to isolate that part of it. I'll try to isolate a part of that code to see. Until then, please tell me if you see anything unsafe or incorrect about my example code. Thank you.
ta0kira

edit:
It appears that the hang is only when the program has permission to connect; therefore, it's probably the server end. It doesn't appear to be a problem with the accept code, so I'm looking at a particular part where select is used.

Last edited by ta0kira; 12-15-2008 at 12:06 PM.
 
Old 12-15-2008, 01:00 PM   #2
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
It turns out the problem was with cross-thread synchronization on the server end. How it's set up is one thread runs select on the socket and another thread accepts, and if there's nothing to accept then the accepting thread blocks on a pthread condition. I do this so that more than one thing can cause the accept thread to resume; that's why the select call isn't in that thread. When the select thread gets a read availability then it sends a broadcast to the pthread condition, causing the accept thread to resume and accept the connection. What apparently was happening on this machine was the select thread got all the way back to select again before the accept thread executed accept, so it went around another iteration. For some reason this caused a hang, but I can't decide why (eventually accept would have cleared the read availability.) Adding a nanosleep just after the condition broadcast took care of it. I'll keep taking a look at it.
ta0kira

PS It appears to be something extremely subtle, probably sensitive to a single-processor machine. I'm sure it had to do with an implicit signed/unsigned conversion somewhere, but it's fixed now.

Last edited by ta0kira; 12-16-2008 at 02:04 AM.
 
Old 12-16-2008, 03:47 AM   #3
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Original Poster
Rep: Reputation: Disabled
I actually found the problem in the most obscure place: a source file that's about 20 lines long (in a completely different program.) It was a technique I use to break a blocked select call by placing half of a dummy pipe in the "read" fd_set. To break the select, I write a byte and wait about 1ms, then read it. What seemed to have happened was something else read the byte, so the read blocked (don't understand the hang, though,) so I changed the read end to non-blocking and it works perfectly (as it did on my main machine.) In case anyone else has this extremely obscure problem. That was a waste of 10 hours...
ta0kira

PS I stopped the hang by setting ulimit -t 3 in the terminal so the kernel killed the process after 3 seconds of hogging the processor. That let me debug it.

Last edited by ta0kira; 12-16-2008 at 03:57 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
error: Can't connect to local MySQL server through socket J0sep Red Hat 20 07-13-2011 09:32 AM
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/my.. Goce Linux - Server 3 11-30-2008 09:57 PM
Frustration: Can't connect to local MySQL server through socket '/usr/local/mysql-5.0 nidala Linux - Newbie 1 09-27-2008 11:20 PM
Can't connect to local MySQL server through socket tommytomato Linux - Newbie 6 05-25-2004 09:16 AM
Can't connect to local MySQL server through socket . . . patpawlowski Programming 2 01-23-2004 03:03 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:56 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration