LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 07-12-2016, 04:35 PM   #1
sailschooner
LQ Newbie
 
Registered: Jun 2015
Posts: 28

Rep: Reputation: Disabled
openmpi haywire!


Hi, I have six machines in a beowulf cluster. Each machine can ssh the others happily enough and NFS works well too (these obviously as far as I can tell that is). The test programs I am using are standards (Hello_World etc). The programs have compiled correctly and can be mpirun on any one machine. When I use the -np 3 the system runs as if there is only one machine as would be expected. Likewise when I -host with only one hostname it works ok. Also when I use -host and -np # with one machine. As I add machine names to the -host though things get awkward. In general, I can always connect two machines, and usually three. More than this and the program stops. In general, it complains about the earliest machine in the list (but not always, sometimes its another). The complaints vary. So

mpirun ex1 : no problem
mpirun -host anymachine ex1 : no problem
mpirun -host anymachine -np 1 ex1 : no problem
mpirun -host master, slave1 -np 2 ex1 :no problem and ok without the -np too.

mpirun -host master,slave1,slave2 -np 3 ex1 : often fails and complains about e.g. master even if I have built the sequence up exactly like this.

Try mpirun -host slave1,slave2,master -np 3 ex1 and slave1 becomes the problem.

Try mpirun -host slave1,slave2 -np 3 ex1 and it works ok again.

I cannot see any pattern other than that it tends to be the earlier machines in the list which fail. But the later ones have too occasionally. And the errors sometimes offered as a cause (tonight's flavour is 'Host key verification') vary too.

Help gratefully received. Many thanks, Adrian
 
Old 07-16-2016, 04:19 AM   #2
Jjanel
Member
 
Registered: Jun 2016
Distribution: any&all, in VBox; Ol'UnixCLI; NO GUI resources
Posts: 999
Blog Entries: 12

Rep: Reputation: 363Reputation: 363Reputation: 363Reputation: 363
Hi. I noticed this didn't get any replys. I'm not familiar with any of this, but maybe I can offer some general suggestions. (Also, I'm kinda new here; I hope my thus incrementing the "reply count" of this to above-zero, doesn't distract others from looking at it!)

reproduce-ability & max specific detail (lowest level of failure) is key to resolving failures.

Maybe there's something like strace -f or logs [maybe remote, on slaveN] that could capture the failure, way down at the [lowest] command or even syscall level! Can you get any lower-level 'thing' that mpirun 'does' (like maybe just ssh) to fail?

Also, more details on the errors: maybe you could make a script that does LIKE: a 100/1000/... mpiruns, each to a [maybe random!] sequence of 3-6 machines, and collects, sorts, &counts the various errors.

The more 'clues', the better...

You can also "Report" your post, to get a Moderator to move it to a more specialized forum (like Server [Cloud?]) (note well: do *NOT* create a new DUPLICATE post!)

Good Luck / Best wishes.
 
Old 07-24-2016, 05:10 AM   #3
sailschooner
LQ Newbie
 
Registered: Jun 2015
Posts: 28

Original Poster
Rep: Reputation: Disabled
Thanks for the reply Jjanel. I'm sorry i haven't replied sooner, my internet has been down for the best part of a week. In the meantime I tried reinstalling openmpi but no joy. SSH and NFS seem solid enough in their own right, working properly etc. I've also tried timing the running of a program on a single machine, using two cores, all as master, all as separate machines, with and without parallel set. It takes (more or less, the number of runs in each condition wasn't great) the same time each time. I've looked at the error log and it seems to be an SSH connection problem, but I can't reproduce it just using SSH between machines. I'll find it eventually ...

Many thanks for your help anyway. I've got a couple more ideas but if they don't work I'll do as you say and pass the problem to another part of te forum.

Adrian
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
problem with Openmpi mahdi109 Linux - Server 5 10-29-2011 10:22 AM
un-installing openmpi sara.y Linux - Software 3 07-18-2009 07:57 AM
Compiz gone haywire dellthinker Linux - Software 4 04-06-2009 07:28 AM
problem with openmpi hanamilani Linux - Software 7 01-16-2009 12:22 PM
Time going haywire obanarama Linux - Hardware 0 04-26-2007 02:27 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 05:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration