Well, my work to troubleshoot the problem has hit a brick wall. Or a pile of them, infact. But first, here was my attempt to try debug this maddening issue:
- Created a script that touches 5000 files, to let me see the frequency of kills:
Code:
root@phoenix:~/cptest# ./cptest.sh
./cptest.sh: line 7: 7936 Killed touch file.i$i
./cptest.sh: line 7: 8657 Killed touch file.i$i
./cptest.sh: line 7: 8798 Killed touch file.i$i
./cptest.sh: line 7: 10004 Killed touch file.i$i
./cptest.sh: line 7: 12061 Killed touch file.i$i
./cptest.sh: line 7: 12484 Killed touch file.i$i
- Used egrep to scan all files in /var/log for instances of 'Killed', nothing.
- Tried to enable process accounting to get proper logging of all kills
- accton /var/log/pacct: No kernel support
Not wanting to compile the kernel just yet, I try something else. SystemTap has a testcase which allows tracing the source of SIGKILL calls.
- Compiled & installed systemtap
- Needs elfutils
- compiled & installed elfutils
- Systemtap could not find kernel-debuginfo
Well then, looks like I'll need to bite the bullet and compile a custom kernel to enable debugging after all
- Current source 2.6.29.2 is different from running kernel.
- Downloaded 2.6.38.2 source
- Configured kernel, started compile.
- Error: no 64bit gcc. System is a 64bit kernel running in a 32bit software environment.
- Downloaded 64bit slackware iso
- Installed 64bit gcc compiler package
- Try to compile kernel: error: libmpft.so.1: missing library
- Reinstalled 32bit gcc package
Rather than compile gcc from source and deal with tedious package location/cohabitation issues,
I went to see what solutions are out there for a mixed 32/64bit environment for my distribution.
- Downloaded slackware 13.0 multilib 32/64bit packages
- Started install of all packages: upgradepkg --reinstall --install-new *.t?z
... Lots of packages start installing as expected.
... I see some ominous 'Killed' messages, as expected for such an intense installation. a dead chmod here, a decapitated rmdir there, etc.
... Until finally the install process grinds to a halt with this sequence of errors:
Code:
*** Pre-installing package glibc-2.9_multilib-x86_64-5alien_slack13.0...
install/doinst.sh: line 221: /usr/bin/rm: No such file or directory
/sbin/installpkg: line 550: /usr/bin/cp: No such file or directory
/sbin/installpkg: line 551: /usr/bin/chmod: No such file or directory
/sbin/upgradepkg: /sbin/removepkg: /bin/sh: bad interpreter: No such file or directory
Cannot install glibc-zoneinfo-2011i_2011n_multilib-noarch-2alien.txz: invalid package extension
Normally this would not worry me, just a typical job failing due to a dependency failure caused by a failed command. Though, something catches my eye, /usr/bin/rm missing? cp, chmod too? That's when I start to worry. It looks like something a little too critical got killed, and I guess some part of the install script assumed a certain command would succeed, such as a cd into a directory it knows exists. The cd must have failed while it was in /usr/bin, which then did an rm -rf *., which most certainly didn't fail.
Which left me with this:
root@phoenix:~/13.0# ls
-bash: /bin/ls: No such file or directory
root@phoenix:~/13.0# whereis ls
-bash: /usr/bin/whereis: No such file or directory
Logged out my second ssh connection, tried to login:
/bin/bash: No such file or directory
Connection to xxx closed.
So sadly, I'll never be able to trace the source of those mysterious SIGKILLs, and will need to rebuild the system from scratch again.