Hi all,
Apologies in advance if this is not the right board to post this on, but I thought I'd give it a shot here first. I'm having a very weird problem on a RedHat EL 4 udpate 5 x86 machine.
Random processes are just being killed without any explanation. There's nothing in any of the logs to explain why. I've googled this a bit and not really found anything.
For example, I was vi'ing a script, and the process just got killed and i was dropped back to my bash prompt. So, every time I try to vi the file again I get this...
Code:
E325: ATTENTION
Found a swap file by the name ".VCSbltftp.sh.swp"
owned by: root dated: Thu Jul 17 21:51:34 2008
file name: /usr/local/bin/VCS/VCSbltftp.sh
modified: YES
user name: root host name: hostname.mydomain.com
process ID: 31854
While opening file "VCSbltftp.sh"
dated: Thu Jul 17 21:28:24 2008
(1) Another program may be editing the same file.
If this is the case, be careful not to end up with two
different instances of the same file when making changes.
Quit, or continue with caution.
(2) An edit session for this file crashed.
If this is the case, use ":recover" or "vim -r VCSbltftp.sh"
to recover the changes (see ":help recovery").
If you did this already, delete the swap file ".VCSbltftp.sh.swp"
to avoid this message.
Swap file ".VCSbltftp.sh.swp" already exists!
[O]pen Read-Only, (E)dit anyway, (R)ecover, (Q)uit, (A)bort, (D)elete it:
All, perfectly normal... but I've found I can reproduce the problem, as If I just leave this and don't answer, within a few seconds it just comes up with the word "Killed" and I'm dropped back to bash.
So, I can the vi and then an strace on it's pid... which comes up with this...
Code:
# strace -p 9509
Process 9509 attached - interrupt to quit
select(1, [0], NULL, [0], NULL) = ? ERESTARTNOHAND (To be restarted)
+++ killed by SIGKILL +++
Process 9509 detached
...and this on another attempt with some different options (still vi'ing the same file)
Code:
# strace -c -v -p 25900
Process 25900 attached - interrupt to quit
Process 25900 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
49.35 0.000800 267 3 write
12.77 0.000207 30 7 1 open
7.83 0.000127 14 9 1 stat64
5.18 0.000084 12 7 close
4.75 0.000077 26 3 read
2.90 0.000047 47 1 connect
2.71 0.000044 9 5 fstat64
2.41 0.000039 39 1 unlink
1.73 0.000028 9 3 ioctl
1.54 0.000025 13 2 mmap2
1.23 0.000020 20 1 send
1.17 0.000019 19 1 socket
1.05 0.000017 17 1 munmap
0.99 0.000016 8 2 fcntl64
0.93 0.000015 15 1 pread64
0.86 0.000014 14 1 1 kill
0.86 0.000014 14 1 recvmsg
0.62 0.000010 5 2 poll
0.31 0.000005 5 1 brk
0.25 0.000004 4 1 access
0.25 0.000004 2 2 1 select
0.19 0.000003 3 1 uname
0.12 0.000002 2 1 getuid32
------ ----------- ----------- --------- --------- ----------------
100.00 0.001621 57 4 total
Nothing obvious. I've read a few posts about OOM killing procs hogging resources, however, in my case I doubt this is the problem... as you can see, I've plenty of free memory etc...
# free -m
total used free shared buffers cached
Mem: 8114 1374 6739 0 157 505
-/+ buffers/cache: 711 7403
Swap: 8136 0 8136
BTW this is a 2 x Quad Core AMD Opteron machine (HP ProLiant DL385 G5). Kernel version is 2.6.9-55.ELsmp.
I've tried disabling SELinux too, but that's made no difference.
Does anyone have any ideas, this is driving me mad.
Cheers!