Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi, I have six machines in a beowulf cluster. Each machine can ssh the others happily enough and NFS works well too (these obviously as far as I can tell that is). The test programs I am using are standards (Hello_World etc). The programs have compiled correctly and can be mpirun on any one machine. When I use the -np 3 the system runs as if there is only one machine as would be expected. Likewise when I -host with only one hostname it works ok. Also when I use -host and -np # with one machine. As I add machine names to the -host though things get awkward. In general, I can always connect two machines, and usually three. More than this and the program stops. In general, it complains about the earliest machine in the list (but not always, sometimes its another). The complaints vary. So
mpirun ex1 : no problem
mpirun -host anymachine ex1 : no problem
mpirun -host anymachine -np 1 ex1 : no problem
mpirun -host master, slave1 -np 2 ex1 :no problem and ok without the -np too.
mpirun -host master,slave1,slave2 -np 3 ex1 : often fails and complains about e.g. master even if I have built the sequence up exactly like this.
Try mpirun -host slave1,slave2,master -np 3 ex1 and slave1 becomes the problem.
Try mpirun -host slave1,slave2 -np 3 ex1 and it works ok again.
I cannot see any pattern other than that it tends to be the earlier machines in the list which fail. But the later ones have too occasionally. And the errors sometimes offered as a cause (tonight's flavour is 'Host key verification') vary too.
Hi. I noticed this didn't get any replys. I'm not familiar with any of this, but maybe I can offer some general suggestions. (Also, I'm kinda new here; I hope my thus incrementing the "reply count" of this to above-zero, doesn't distract others from looking at it!)
reproduce-ability & max specific detail (lowest level of failure) is key to resolving failures.
Maybe there's something like strace -f or logs [maybe remote, on slaveN] that could capture the failure, way down at the [lowest] command or even syscall level! Can you get any lower-level 'thing' that mpirun 'does' (like maybe just ssh) to fail?
Also, more details on the errors: maybe you could make a script that does LIKE: a 100/1000/... mpiruns, each to a [maybe random!] sequence of 3-6 machines, and collects, sorts, &counts the various errors.
The more 'clues', the better...
You can also "Report" your post, to get a Moderator to move it to a more specialized forum (like Server [Cloud?]) (note well: do *NOT* create a new DUPLICATE post!)
Thanks for the reply Jjanel. I'm sorry i haven't replied sooner, my internet has been down for the best part of a week. In the meantime I tried reinstalling openmpi but no joy. SSH and NFS seem solid enough in their own right, working properly etc. I've also tried timing the running of a program on a single machine, using two cores, all as master, all as separate machines, with and without parallel set. It takes (more or less, the number of runs in each condition wasn't great) the same time each time. I've looked at the error log and it seems to be an SSH connection problem, but I can't reproduce it just using SSH between machines. I'll find it eventually ...
Many thanks for your help anyway. I've got a couple more ideas but if they don't work I'll do as you say and pass the problem to another part of te forum.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.