Parallel mpich2-based software does not run
Hi,
I'm trying to run some software parallelised through mpich2. I have the daemon running (mpd). I then launch the software: nohup mpiexec -n 8 /path/to/software/softwarename.ex > out 2> err & and when I check with "ps aux" I can see I have 8 copies of the software running and 8 listings of mpd, however according to "top", these processes are running at 0%, and indeed there is no output - it just hangs. This happens on some computers, but not on others, so I can rule out a problem with the software itself. Does anyone know what might cause this? Thanks in advance. |
Update
Update:
It seems if I specify "-n 1" I can get it running on one processor, but any more and nothing happens. So maybe it's something to do with mpich2? |
Another update
By doing "strace" it produces many lines of the following output:
select(7, [4 5 6], [], [], {1, 0}) = 0 (Timeout) select(7, [4 5 6], [], [], {1, 0}) = 0 (Timeout) select(7, [4 5 6], [], [], {1, 0}) = 0 (Timeout) select(7, [4 5 6], [], [], {1, 0}) = 0 (Timeout) select(7, [4 5 6], [], [], {1, 0}) = 0 (Timeout) select(7, [4 5 6], [], [], {1, 0}) = 0 (Timeout) etc Does anyone know what's wrong? Until I fix this I can barely do any calculations, so I'm quite desperate to get it working. I've tried restarting the computer, and also re-installing Mpich2 using shm instead of nemesis (since it's simply running on a multi-core computer) but without any change in the situation. Thank you. |
ssh?
Mpich uses ssh to communicate with each node right if I remember correctly. Maybe on some of your computers you don't have ssh enabled.
|
Thanks a lot for your suggestion, but in this case, ssh is working for the computer.
Additionally, since I'm running on a multi-core processor, all spawns of the software occur on one processor, on the same computer, and ssh is not involved in this case. |
A kind-of solution
For future reference: I did a work-around by deleting mpich2 and installing openmpi instead... it requires re-compilation of any software that was originally compiled for mpich2, but once that's done it works fine, and apparently without the need for an mpd like mpich2 needed.
|
All times are GMT -5. The time now is 09:19 AM. |