LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Processes show 'Killed' at random during compile jobs under any user (https://www.linuxquestions.org/questions/linux-software-2/processes-show-killed-at-random-during-compile-jobs-under-any-user-922879/)

Draeguin 01-09-2012 10:56 AM

Processes show 'Killed' at random during compile jobs under any user
 
Hello all,

* The plea:

Sadly, this issue exhausted my google-fu, bashed it over the head then unceremoniously set it on fire and left it to rot. :( I really hope there's someone out there who can point me in the right direction, this has been driving me nuts! I'll be sure to report the eventual solution here, to prevent any future poor soul having to go through the same thing.


* Problem:

At random, any single recently-started process will simply show as 'Killed', which can be anything from a cp command, to about 5-6 different processes during a compile. There is no pattern to it I can see, and it causes issues like having to retry a compile from 1-6 times until either no kills occured, or they happened to insignificant tests during ./configure. It seems to happen to roughly every 100-400th process.

The random kills don't immediatley dump you to the command line, but they often cause compile jobs fail with bizarre problems ranging from access denied, file missing, gcc internal errors, undeclared functions, missing includes and headers, etc.


* Attempted solutions:
  • I have run a memtest & cpu test with no errors.
  • dmesg, /var/log/messages and /var/log/debug show nothing whenever a process is selected for the guillotine.
  • I have fiddled with ulimits, made them unlimited or very roomy
  • Tried compiling under root as well as the user
  • Checked every log under /var/log for anything relevent, such as syslog, messages, debug, secure, nothing is written to them when a kill occurs

I have included some system info at the end, including uname, ulimit output.

* Random examples of the issue manifesting when compiling Unrealircd:
Code:

Attempt #1:
gcc: Internal error: Killed (program collect2)

#2
configure: creating ./config.status
./configure: line 16906:  9341 Killed                  cat >>$CONFIG_STATUS  <<_ACEOF

#3:
../libtool: line 5803:  3337 Done                    $echo "X$obj"
      3338 Killed                  | $Xsed -e 's%^.*/%%'

#4:
checking whether we are using the GNU C compiler... ./configure: line 3200:  5878 Killed                  cat confdefs.h >>conftest.$ac_ext

#5:
checking for struct in6_addr... ./configure: line 19057: 14120 Killed                  rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext

Neostats #1:
checking for strcmpi... no
checking deeper for strcmpi... ./configure: line 13589: 19381 Killed                  grep -v '^ *+' conftest.er1 >conftest.err

#2:
checking types of arguments and return type for send... int,const void *,size_t,int,int
./configure: line 17008: 21550 Killed

#3:
/bin/sh: line 9:  5741 Killed                  make $local_target
make[1]: *** [all-recursive] Error 1[...]

System info:

Slackware 13.0, 64bit
Type: Dedicated server at OVH, no VPS or virtual apps installed.
RAM: 4gb
HDD: 2x 750GB RAID1
CPU: Intel E8400 3ghz x2 core.

uname -a:
Code:

Linux phoenix 2.6.38.2-grsec-xxxx-grs-ipv6-64 #2 SMP Thu Aug 25 16:40:22 UTC 2011 x86_64 Intel(R) Core(TM)2 Duo CPU    E8400  @ 3.00GHz GenuineIntel GNU/Linux

ulimit -a (as user):
Code:

core file size          (blocks, -c) unlimited
data seg size          (kbytes, -d) unlimited
file size              (blocks, -f) unlimited
pending signals                (-i) 31300
max locked memory      (kbytes, -l) 64
max memory size        (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues    (bytes, -q) 819200
stack size              (kbytes, -s) 128000
cpu time              (seconds, -t) unlimited
max user processes              (-u) 31300
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

ulimit -a (as root):
Code:

core file size          (blocks, -c) unlimited
data seg size          (kbytes, -d) unlimited
file size              (blocks, -f) unlimited
pending signals                (-i) 31300
max locked memory      (kbytes, -l) 64
max memory size        (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues    (bytes, -q) 819200
stack size              (kbytes, -s) 128000
cpu time              (seconds, -t) unlimited
max user processes              (-u) 31300
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


chrism01 01-09-2012 07:49 PM

Couple of things to look at;

1. could be the OOM-Killer https://en.wikipedia.org/wiki/Out_of_memory https://lwn.net/Articles/317814/.

2. if a process has been killed unceremoniously in the past, you can end up with corrupt files that then cause other stuff using them to die as well....

Draeguin 01-10-2012 05:22 AM

Thanks for the response chrism,

That's a good call, I checked /proc/sys/vm/overcommit_memory, but it has a value of 0, which I assume means no overcommission of memory is allowed, rendering OOM a non-issue.

You may be quite right about corruption, I really have no idea what potentially silent and delicate commands may have failed, I might see some effects of this in the future whenever I do figure this out (and don't just pick the nuclear-option of wiping and rebuilding the system from scratch)

Edit: The system also has around 2-3gb of memory free at the time any of these tasks took place.

Draeguin 01-11-2012 06:26 AM

Well, my work to troubleshoot the problem has hit a brick wall. Or a pile of them, infact. But first, here was my attempt to try debug this maddening issue:

- Created a script that touches 5000 files, to let me see the frequency of kills:
Code:

root@phoenix:~/cptest# ./cptest.sh
./cptest.sh: line 7:  7936 Killed                  touch file.i$i
./cptest.sh: line 7:  8657 Killed                  touch file.i$i
./cptest.sh: line 7:  8798 Killed                  touch file.i$i
./cptest.sh: line 7: 10004 Killed                  touch file.i$i
./cptest.sh: line 7: 12061 Killed                  touch file.i$i
./cptest.sh: line 7: 12484 Killed                  touch file.i$i

- Used egrep to scan all files in /var/log for instances of 'Killed', nothing.
- Tried to enable process accounting to get proper logging of all kills
- accton /var/log/pacct: No kernel support

Not wanting to compile the kernel just yet, I try something else. SystemTap has a testcase which allows tracing the source of SIGKILL calls.

- Compiled & installed systemtap
- Needs elfutils
- compiled & installed elfutils
- Systemtap could not find kernel-debuginfo

Well then, looks like I'll need to bite the bullet and compile a custom kernel to enable debugging after all :(

- Current source 2.6.29.2 is different from running kernel.
- Downloaded 2.6.38.2 source
- Configured kernel, started compile.
- Error: no 64bit gcc. System is a 64bit kernel running in a 32bit software environment.
- Downloaded 64bit slackware iso
- Installed 64bit gcc compiler package
- Try to compile kernel: error: libmpft.so.1: missing library
- Reinstalled 32bit gcc package

Rather than compile gcc from source and deal with tedious package location/cohabitation issues,
I went to see what solutions are out there for a mixed 32/64bit environment for my distribution.

- Downloaded slackware 13.0 multilib 32/64bit packages
- Started install of all packages: upgradepkg --reinstall --install-new *.t?z
... Lots of packages start installing as expected.
... I see some ominous 'Killed' messages, as expected for such an intense installation. a dead chmod here, a decapitated rmdir there, etc.
... Until finally the install process grinds to a halt with this sequence of errors:

Code:

*** Pre-installing package glibc-2.9_multilib-x86_64-5alien_slack13.0...
install/doinst.sh: line 221: /usr/bin/rm: No such file or directory
/sbin/installpkg: line 550: /usr/bin/cp: No such file or directory
/sbin/installpkg: line 551: /usr/bin/chmod: No such file or directory
/sbin/upgradepkg: /sbin/removepkg: /bin/sh: bad interpreter: No such file or directory
Cannot install glibc-zoneinfo-2011i_2011n_multilib-noarch-2alien.txz:  invalid package extension

Normally this would not worry me, just a typical job failing due to a dependency failure caused by a failed command. Though, something catches my eye, /usr/bin/rm missing? cp, chmod too? That's when I start to worry. It looks like something a little too critical got killed, and I guess some part of the install script assumed a certain command would succeed, such as a cd into a directory it knows exists. The cd must have failed while it was in /usr/bin, which then did an rm -rf *., which most certainly didn't fail.

Which left me with this:

root@phoenix:~/13.0# ls
-bash: /bin/ls: No such file or directory
root@phoenix:~/13.0# whereis ls
-bash: /usr/bin/whereis: No such file or directory

Logged out my second ssh connection, tried to login:

/bin/bash: No such file or directory
Connection to xxx closed.

So sadly, I'll never be able to trace the source of those mysterious SIGKILLs, and will need to rebuild the system from scratch again.


All times are GMT -5. The time now is 08:35 PM.