Error While Opening Multiple SSH Connections through PBS in Linux Server

pritamthecoder · 12-24-2019, 12:50 AM

I have a cluster with one master node and 8 working nodes. In general I use pbs script to give processes to the working nodes from the master. Here I am talking about one single working node. Each of the nodes have 32 cores. So, I tried to send 32 concurrent processes from master to a particular working node. But only some of them are actually being executed while the rest are not.

Just for the checking purpose I am executing a code which prints a number after looping through a large enough loop.

Code:

#include<stdio.h>
int main(int argc, char *argv[])
{
  long i,j=0;
  for(i=0;i<1000000000;i++)
  {
    j++;
  }
  printf("\n%ld",atoi(argv[1]));
  return 1;
}

Now, this is being compiled and put into an object file named, test_pbs.

My pbs script is as follows,

Code:

#!/bin/sh

#PBS -N nfs
#PBS -l nodes=8:ppn=32
#PBS -o "<path>/stdout.log"
#PBS -e "<path>/stderr.log"

echo "Starting the sieving..."
ssh <node_url> "<path>/test_pbs 0" &
ssh <node_url> "<path>/test_pbs 1" &
ssh <node_url> "<path>/test_pbs 2" &
ssh <node_url> "<path>/test_pbs 3" &
ssh <node_url> "<path>/test_pbs 4" &
ssh <node_url> "<path>/test_pbs 5" &
ssh <node_url> "<path>/test_pbs 6" &
ssh <node_url> "<path>/test_pbs 7" &
ssh <node_url> "<path>/test_pbs 8" &
ssh <node_url> "<path>/test_pbs 9" &
ssh <node_url> "<path>/test_pbs 10" &
ssh <node_url> "<path>/test_pbs 11" &
ssh <node_url> "<path>/test_pbs 12" &
ssh <node_url> "<path>/test_pbs 13" &
ssh <node_url> "<path>/test_pbs 14" &
ssh <node_url> "<path>/test_pbs 15" &
ssh <node_url> "<path>/test_pbs 16" &
ssh <node_url> "<path>/test_pbs 17" &
ssh <node_url> "<path>/test_pbs 18" &
ssh <node_url> "<path>/test_pbs 19"

Now, it should print 0-19 in the stdout.log file. But some of them are printed and for the rest I am getting the line ssh_exchange_identification: read: Connection reset by peer in ther stderr.log file. I am using CENT OS in the cluster and I have also checked, /etc/security/limits.conf

Code:

# /etc/security/limits.conf
#
#This file sets the resource limits for the users logged in via PAM.
#It does not affect resource limits of the system services.
#
#Also note that configuration files in /etc/security/limits.d directory,
#which are read in alphabetical order, override the settings in this
#file in case the domain is the same or more specific.
#That means for example that setting a limit for wildcard domain here
#can be overriden with a wildcard setting in a config file in the
#subdirectory, but a user specific setting here can be overriden only
#with a user specific setting in the subdirectory.
#
#Each line describes a limit for a user in the form:
#
#<domain>        <type>  <item>  <value>
#
#Where:
#<domain> can be:
#        - a user name
#        - a group name, with @group syntax
#        - the wildcard *, for default entry
#        - the wildcard %, can be also used with %group syntax,
#                 for maxlogin limit
#
#<type> can have the two values:
#        - "soft" for enforcing the soft limits
#        - "hard" for enforcing hard limits
#
#<item> can be one of the following:
#        - core - limits the core file size (KB)
#        - data - max data size (KB)
#        - fsize - maximum filesize (KB)
#        - memlock - max locked-in-memory address space (KB)
#        - nofile - max number of open file descriptors
#        - rss - max resident set size (KB)
#        - stack - max stack size (KB)
#        - cpu - max CPU time (MIN)
#        - nproc - max number of processes
#        - as - address space limit (KB)
#        - maxlogins - max number of logins for this user
#        - maxsyslogins - max number of logins on the system
#        - priority - the priority to run user process with
#        - locks - max number of file locks the user can hold
#        - sigpending - max number of pending signals
#        - msgqueue - max memory used by POSIX message queues (bytes)
#        - nice - max nice priority allowed to raise to values: [-20, 19]
#        - rtprio - max realtime priority
#
#<domain>      <type>  <item>         <value>
#

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4

# End of file
* soft memlock unlimited
* hard memlock unlimited

But there is no hard coded limitation on the number of processes to be spawn or maximum numbers of login. I have also tried to physically make >= 20 concurrent connections to the node. But it os successful.

So, my question is that where the issue arises ?

berndbausch · 12-24-2019, 09:24 PM

Quote:

Originally Posted by pritamthecoder

Now, it should print 0-19 in the stdout.log file. But some of them are printed and for the rest I am getting the line ssh_exchange_identification: read: Connection reset by peer in ther stderr.log file.

Is there consistency? I.e., does the connection to a given node always succeed or fail, or does the connection to a given node sometimes succeed, and sometimes fail?

As a first step, I would check /var/log/secure on the failing nodes and use the ssh client's -v option to get more detail about the failure. For even more detail, use -vv and -vvv.
It's also possible to run the ssh server with a debug option, though this implies it running in the foreground if I am not mistaken.

DISCLAIMER: I know nothing about PBS and can't tell if your problem is related to the scheduler in any way.