I have a cluster with one master node and 8 working nodes. In general I use pbs script to give processes to the working nodes from the master. Here I am talking about one single working node. Each of the nodes have 32 cores. So, I tried to send 32 concurrent processes from master to a particular working node. But only some of them are actually being executed while the rest are not.
Just for the checking purpose I am executing a code which prints a number after looping through a large enough loop.
Code:
#include<stdio.h>
int main(int argc, char *argv[])
{
long i,j=0;
for(i=0;i<1000000000;i++)
{
j++;
}
printf("\n%ld",atoi(argv[1]));
return 1;
}
Now, this is being compiled and put into an object file named, test_pbs.
My pbs script is as follows,
Code:
#!/bin/sh
#PBS -N nfs
#PBS -l nodes=8:ppn=32
#PBS -o "<path>/stdout.log"
#PBS -e "<path>/stderr.log"
echo "Starting the sieving..."
ssh <node_url> "<path>/test_pbs 0" &
ssh <node_url> "<path>/test_pbs 1" &
ssh <node_url> "<path>/test_pbs 2" &
ssh <node_url> "<path>/test_pbs 3" &
ssh <node_url> "<path>/test_pbs 4" &
ssh <node_url> "<path>/test_pbs 5" &
ssh <node_url> "<path>/test_pbs 6" &
ssh <node_url> "<path>/test_pbs 7" &
ssh <node_url> "<path>/test_pbs 8" &
ssh <node_url> "<path>/test_pbs 9" &
ssh <node_url> "<path>/test_pbs 10" &
ssh <node_url> "<path>/test_pbs 11" &
ssh <node_url> "<path>/test_pbs 12" &
ssh <node_url> "<path>/test_pbs 13" &
ssh <node_url> "<path>/test_pbs 14" &
ssh <node_url> "<path>/test_pbs 15" &
ssh <node_url> "<path>/test_pbs 16" &
ssh <node_url> "<path>/test_pbs 17" &
ssh <node_url> "<path>/test_pbs 18" &
ssh <node_url> "<path>/test_pbs 19"
Now, it should print 0-19 in the stdout.log file. But some of them are printed and for the rest I am getting the line ssh_exchange_identification: read: Connection reset by peer in ther stderr.log file. I am using CENT OS in the cluster and I have also checked, /etc/security/limits.conf
Code:
# /etc/security/limits.conf
#
#This file sets the resource limits for the users logged in via PAM.
#It does not affect resource limits of the system services.
#
#Also note that configuration files in /etc/security/limits.d directory,
#which are read in alphabetical order, override the settings in this
#file in case the domain is the same or more specific.
#That means for example that setting a limit for wildcard domain here
#can be overriden with a wildcard setting in a config file in the
#subdirectory, but a user specific setting here can be overriden only
#with a user specific setting in the subdirectory.
#
#Each line describes a limit for a user in the form:
#
#<domain> <type> <item> <value>
#
#Where:
#<domain> can be:
# - a user name
# - a group name, with @group syntax
# - the wildcard *, for default entry
# - the wildcard %, can be also used with %group syntax,
# for maxlogin limit
#
#<type> can have the two values:
# - "soft" for enforcing the soft limits
# - "hard" for enforcing hard limits
#
#<item> can be one of the following:
# - core - limits the core file size (KB)
# - data - max data size (KB)
# - fsize - maximum filesize (KB)
# - memlock - max locked-in-memory address space (KB)
# - nofile - max number of open file descriptors
# - rss - max resident set size (KB)
# - stack - max stack size (KB)
# - cpu - max CPU time (MIN)
# - nproc - max number of processes
# - as - address space limit (KB)
# - maxlogins - max number of logins for this user
# - maxsyslogins - max number of logins on the system
# - priority - the priority to run user process with
# - locks - max number of file locks the user can hold
# - sigpending - max number of pending signals
# - msgqueue - max memory used by POSIX message queues (bytes)
# - nice - max nice priority allowed to raise to values: [-20, 19]
# - rtprio - max realtime priority
#
#<domain> <type> <item> <value>
#
#* soft core 0
#* hard rss 10000
#@student hard nproc 20
#@faculty soft nproc 20
#@faculty hard nproc 50
#ftp hard nproc 0
#@student - maxlogins 4
# End of file
* soft memlock unlimited
* hard memlock unlimited
But there is no hard coded limitation on the number of processes to be spawn or maximum numbers of login. I have also tried to physically make >= 20 concurrent connections to the node. But it os successful.
So, my question is that where the issue arises ?