I am running on a cluster and there is a job on one of the nodes which is hung and causing problems. But if I
rsh over to the node, I can't see or kill the job.
The offending job is on
node2. Running from the master node, I can see the offending job using
rsh like this:
Code:
master>rsh node2 ps
PID TTY TIME CMD
15413 ? 1-03:09:36 MYPROG
20770 ? 00:00:00 ps
master>
Process 15413 will not quit! I have no idea what it is or why it started.
But if I
rsh over to
node2, and try
ps, then the job isn't there:
Code:
master>rsh node2
Last login: Thu Jun 6 21:59:36 EDT 2013
node2>ps
PID TTY TIME CMD
20856 pts/0 00:00:00 bash
20867 pts/0 00:00:00 ps
node2>
Job not there. No way for me to kill it.
Next, I go back to master, and the job appears again:
Code:
node2>logout
rlogin: connection closed.
master>rsh node2 ps
PID TTY TIME CMD
15413 ? 1-03:09:36 MYPROG
20897 ? 00:00:00 ps
master>
So I figured I would try to kill the job using
rsh, but that didn't work, either:
Code:
master>rsh node2 "kill 15413"
master>rsh node2 ps
PID TTY TIME CMD
15413 ? 1-03:09:36 MYPROG
20946 ? 00:00:00 ps
master>rsh node2 "kill -9 -1"
master>rsh node2 ps
PID TTY TIME CMD
15413 ? 1-03:09:36 MYPROG
20980 ? 00:00:00 ps
How come ps can see this job when I use rsh, but nothing else can? More importantly, how do I kill this job?