Processes show 'Killed' at random during compile jobs under any user
Hello all,
* The plea: Sadly, this issue exhausted my google-fu, bashed it over the head then unceremoniously set it on fire and left it to rot. :( I really hope there's someone out there who can point me in the right direction, this has been driving me nuts! I'll be sure to report the eventual solution here, to prevent any future poor soul having to go through the same thing. * Problem: At random, any single recently-started process will simply show as 'Killed', which can be anything from a cp command, to about 5-6 different processes during a compile. There is no pattern to it I can see, and it causes issues like having to retry a compile from 1-6 times until either no kills occured, or they happened to insignificant tests during ./configure. It seems to happen to roughly every 100-400th process. The random kills don't immediatley dump you to the command line, but they often cause compile jobs fail with bizarre problems ranging from access denied, file missing, gcc internal errors, undeclared functions, missing includes and headers, etc. * Attempted solutions:
I have included some system info at the end, including uname, ulimit output. * Random examples of the issue manifesting when compiling Unrealircd: Code:
Attempt #1: Slackware 13.0, 64bit Type: Dedicated server at OVH, no VPS or virtual apps installed. RAM: 4gb HDD: 2x 750GB RAID1 CPU: Intel E8400 3ghz x2 core. uname -a: Code:
Code:
Code:
|
Couple of things to look at;
1. could be the OOM-Killer https://en.wikipedia.org/wiki/Out_of_memory https://lwn.net/Articles/317814/. 2. if a process has been killed unceremoniously in the past, you can end up with corrupt files that then cause other stuff using them to die as well.... |
Thanks for the response chrism,
That's a good call, I checked /proc/sys/vm/overcommit_memory, but it has a value of 0, which I assume means no overcommission of memory is allowed, rendering OOM a non-issue. You may be quite right about corruption, I really have no idea what potentially silent and delicate commands may have failed, I might see some effects of this in the future whenever I do figure this out (and don't just pick the nuclear-option of wiping and rebuilding the system from scratch) Edit: The system also has around 2-3gb of memory free at the time any of these tasks took place. |
Well, my work to troubleshoot the problem has hit a brick wall. Or a pile of them, infact. But first, here was my attempt to try debug this maddening issue:
- Created a script that touches 5000 files, to let me see the frequency of kills: Code:
root@phoenix:~/cptest# ./cptest.sh - Tried to enable process accounting to get proper logging of all kills - accton /var/log/pacct: No kernel support Not wanting to compile the kernel just yet, I try something else. SystemTap has a testcase which allows tracing the source of SIGKILL calls. - Compiled & installed systemtap - Needs elfutils - compiled & installed elfutils - Systemtap could not find kernel-debuginfo Well then, looks like I'll need to bite the bullet and compile a custom kernel to enable debugging after all :( - Current source 2.6.29.2 is different from running kernel. - Downloaded 2.6.38.2 source - Configured kernel, started compile. - Error: no 64bit gcc. System is a 64bit kernel running in a 32bit software environment. - Downloaded 64bit slackware iso - Installed 64bit gcc compiler package - Try to compile kernel: error: libmpft.so.1: missing library - Reinstalled 32bit gcc package Rather than compile gcc from source and deal with tedious package location/cohabitation issues, I went to see what solutions are out there for a mixed 32/64bit environment for my distribution. - Downloaded slackware 13.0 multilib 32/64bit packages - Started install of all packages: upgradepkg --reinstall --install-new *.t?z ... Lots of packages start installing as expected. ... I see some ominous 'Killed' messages, as expected for such an intense installation. a dead chmod here, a decapitated rmdir there, etc. ... Until finally the install process grinds to a halt with this sequence of errors: Code:
*** Pre-installing package glibc-2.9_multilib-x86_64-5alien_slack13.0... Which left me with this: root@phoenix:~/13.0# ls -bash: /bin/ls: No such file or directory root@phoenix:~/13.0# whereis ls -bash: /usr/bin/whereis: No such file or directory Logged out my second ssh connection, tried to login: /bin/bash: No such file or directory Connection to xxx closed. So sadly, I'll never be able to trace the source of those mysterious SIGKILLs, and will need to rebuild the system from scratch again. |
All times are GMT -5. The time now is 08:35 PM. |