I am trying to implement some distributed algorithm (minimum spanning tree) where each node is represented by a process running in a different machine. Each edge represented by a client server communication channel. Suppose host0 is connected to host1, host2 and host5. I create 1 server sockets in each of host2, host1 and host5. Then I am trying to create 3 clients in host0 to connect with host1, host2 and host5. I am creating say 3 threads to in host0 that are each creating 3 sockets. Each client thread is calling the following function to create the sockets:

bsocket create_socket_and_connect(char* server_addr, int port)
	char log_msg[1000];	
	sprintf(log_msg,"(%s)server: %s, port: %d\n",node_id, server_addr, port);
	write_to_log_file(debug_log_file, log_msg, &mutex_debug_log_file);

	bsocket sock;
	struct addrinfo hints, *servinfo, *p;
	int rv;
	int sockfd;  

	memset(&hints, 0, sizeof hints);
	hints.ai_family = AF_UNSPEC;
	hints.ai_socktype = SOCK_STREAM;

	// todo: make PORT user input or something like that
	char port_str[10];
	sprintf(port_str, "%d", port);
	if ((rv = getaddrinfo(server_addr, port_str, &hints, &servinfo)) != 0) {
		//fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(rv));
		sprintf(log_msg, "(%s)client: addrinfo(%s, %d)\n",node_id,server_addr, port);
		write_to_log_file(debug_log_file, log_msg, &mutex_debug_log_file);

	sprintf(log_msg, "(%s)(%s, %d) is ready to begin socketing\n",node_id, server_addr, port);
	write_to_log_file(debug_log_file, log_msg, &mutex_debug_log_file);

	// loop through all the results and connect to the first we can
	int is_connected = 0;
	//while (!is_connected)
	for(p = servinfo; p != NULL; p = p->ai_next) {
		if ((sockfd = socket(p->ai_family, p->ai_socktype,
				p->ai_protocol)) == -1) {
			char err_msg[100];
			sprintf(err_msg,"(%s)client: socket(%s, %d)",node_id, server_addr, port);
			write_to_log_file(debug_log_file, err_msg, &mutex_debug_log_file);
			random_sleep(5000, 60000);

		if (connect(sockfd, p->ai_addr, p->ai_addrlen) == -1) {
			char err_msg[100];
			sprintf(err_msg,"(%s)client: connect(%s, %d)",node_id, server_addr, port);
			write_to_log_file(debug_log_file, err_msg, &mutex_debug_log_file);
			random_sleep(5000, 60000);
		is_connected = 1;
	sprintf(log_msg, "(%s) established socket: %d(%s %d)\n",node_id, sockfd, server_addr, port);
	write_to_log_file(debug_log_file, log_msg, &mutex_debug_log_file);
	sock.sockfd = sockfd;
	sock.servinfo = *p;
	return sock;

The output is this:

(host0)server: abc, port: 4002

(host0)(abc, 4002) is ready to begin socketing

(host0)client: connect(abc, 4002)

(host0)server: def, port: 4003

(host0)server: ghi, port: 4006

(host0)(def, 4003) is ready to begin socketing

(host0)client: connect(def, 4003)

(host0)(ghi, 4006) is ready to begin socketing

(host0)client: connect(ghi, 4006)

(host0) established socket: 11(abc 4002)

The problem is only 1 socket is created successfully. No other sockets are created. There is not even an error message shown for the other threads. As if they are dead or something like that. If I make this a loop so that the threads persistently tries to acquire a socket until successful then no socket is created at all. I have created a random intervening sleep for 5 - 60 second between each iteration.

Is there any limitations on threads and sockets? As threads are supposed to share many things unlike processes which have different states like file descriptors etc. Is this possible that only 1 socket descriptor per process (no matter how many threads) is created and all the threads are supposed to share one socket descriptor? Otherwise there is no plausible explanation of this. The nodes are connected to the same LAN. Is there also any likelihood that messages are getting lost. But in any case other unfortunate threads should show some error messages while creating/connecting sockets.

From what I see in your log, all sockets were created fine, yet non of them successfully connected (you print the client: connect message on connect failure). Now the question is why only one thread reaches line 58. Hard to tell, I would look closely for the deadlock in write_to_log_file.

PS: It is always helpful to include strerror(errno) into the error message.

This article has been dead for over six months. Start a new discussion instead.