ulimit -s 40960 vs ulimit ulimit -s 10240
I wrote this because i was able to use openmpi to run mpirun
on my 12-core workstation rather happily since day 1 I setup the system a few months ago.
Yesterday when I tried to run a big job under mpirun, the job crashed
rather quickly, the error message was something like
mpirun process exited blah blah with signal 11 (Segmentation fault).
Interestingly (or annoyingly) a job required less memory ran okay.
Since I never had this problem before, I thought it was the hardware
failure. I called my IT guy to explain the problem and he is kind
enough to suggest to put a line
ulimit -s 40960
in my .bashrc.
And it works!
But I have no clue why mpirun misbehaves out of a sudden, and that
ulimit setting solves the problem completely. I would like to learn
from this incident.
Anyone has any idea to share ? Thanks a lot!
okay. My happiness is short-lived.
I still hit
mpirun noticed that process rank 3 with PID 11591 on node xxx-node exited on signal 11 (Segmentation fault).
problem when I tried a big job.
I strongly suspect that this has to do with a huge job that crashed the day before
when the disk space ran out. Could it be that the crashed job is dumping something to the
commonly used space and it was not clear in time for new jobs to used.
|All times are GMT -5. The time now is 02:21 PM.|