LinuxQuestions.org - stopped processes - finding out who/what killed them?

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - stopped processes - finding out who/what killed them? (https://www.linuxquestions.org/questions/linux-newbie-8/stopped-processes-finding-out-who-what-killed-them-936827/)

stopped processes - finding out who/what killed them?

Dear friends,

i can be regarded as a newcomer to linux, started learning and using linux within the last 2-3 months, because of some troubles i had during my M.Sc. studies at the university.

my problem is as follows:

i'm working a simulation program written in the fortran language. depending on the size of the system being simulated, the program needs to run for several days (typically 3-7 days). i usually start 10 simulations at one time. each program generating 60 files which are about 270 KB in size, and some smaller files which take a total space of about 20 MB.

i go to the university and check the programs (processes) almost every day and find out that the processes are not running anymore, that they have been stopped/killed. checked with the "last" command, making sure that there has not been a system re-boot. it's very very annoying not being able to make any progress, and if i can't find out and fix what is happening, it is going to cost me a whole semester :s

when i use the "top" command, i can see that all the 10 processes share the available CPU and each of the processes is using only 0.1 % of memory.

so: what/who is continously killing my processes and how can i find out what happened? (by the way: i'm not the system root)

your help will be much appreciated.

thanks in advance,

alp

additional info: (after the reply of pan64)

i'm logging in to the server (where i run my simulations) from a nearby pc running windows xp, by using "putty" and "SSHClient" for file transfer. i thought it might have anything to do with the system or hardware (CPU/RAM). so here is what i found out:

CPU info (using the command cat /proc/cpuinfo)

the system has 24 processors and all have in common the following properties.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
stepping : 2
cpu MHz : 2661.000
cache size : 12288 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida pni monitor ds_cpl vmx smx est tm2 cx16 xtpr popcnt lahf_lm
bogomips : 5337.46
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

Memory info: (using the command cat /proc/cpuinfo)
MemTotal: 16426712 kB
MemFree: 14492536 kB
Buffers: 199936 kB
Cached: 1409904 kB
SwapCached: 0 kB
Active: 1075904 kB
Inactive: 670136 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 16426712 kB
LowFree: 14492536 kB
SwapTotal: 18481144 kB
SwapFree: 18481144 kB
Dirty: 1228 kB
Writeback: 0 kB
AnonPages: 136296 kB
Mapped: 28424 kB
Slab: 136472 kB
PageTables: 8952 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 26694500 kB
Committed_AS: 230004 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 272460 kB
VmallocChunk: 34359465863 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 2048 kB

OS : Red Hat Enterprise Linux Server release 5.3 (Tikanga)

and finally a snapshot of the top command (file output of the top command): i'm the user "alpaslan"

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7514 alpaslan 25 0 18012 5596 804 R 100.3 0.0 1013:45 mx
7586 alpaslan 25 0 18012 5580 788 R 100.3 0.0 989:40.53 mx
7658 alpaslan 25 0 18012 5576 788 R 100.3 0.0 1013:32 mx
7694 alpaslan 25 0 18012 5588 804 R 100.3 0.0 998:22.96 mx
7766 alpaslan 25 0 18012 5588 804 R 100.3 0.0 1002:55 mx
7802 alpaslan 25 0 18012 5580 788 R 100.3 0.0 996:23.61 mx
7838 alpaslan 25 0 18012 5580 788 R 100.3 0.0 992:09.72 mx
8640 user2 25 0 14644 2236 804 R 100.3 0.0 829:19.04 ny8
8875 user2 25 0 15512 3124 808 R 100.3 0.0 832:40.75 ny16
8920 user2 25 0 15800 3412 800 R 100.3 0.0 808:26.84 ny17
9196 user2 25 0 15536 3124 812 R 100.3 0.0 806:27.29 b3
7730 alpaslan 25 0 18012 5596 800 R 98.4 0.0 1016:21 mx
8595 user2 25 0 14512 2116 816 R 98.4 0.0 819:54.54 ny7
8687 user2 25 0 14776 2384 808 R 98.4 0.0 824:16.45 ny9
8828 user2 25 0 15220 2816 808 R 98.4 0.0 800:44.19 ny15
8967 user2 25 0 16092 3708 804 R 98.4 0.0 798:22.93 ny18
9012 user2 25 0 16384 4008 804 R 98.4 0.0 808:55.70 ny19
9243 user2 25 0 16052 3660 804 R 98.4 0.0 818:57.09 b4
9290 user2 25 0 16564 4180 804 R 98.4 0.0 802:05.27 b5
9151 user2 25 0 15020 2620 812 R 94.6 0.0 827:16.84 b2
9059 user2 25 0 13992 1604 836 R 53.0 0.0 820:52.76 b
8781 user2 25 0 14928 2528 812 R 51.1 0.0 820:49.53 ny14
9106 user2 25 0 14504 2108 824 R 51.1 0.0 823:42.60 b1
7550 alpaslan 25 0 18012 5580 788 R 49.2 0.0 984:19.36 mx
7622 alpaslan 25 0 18012 5588 804 R 49.2 0.0 983:44.28 mx
8736 user2 25 0 14636 2244 820 R 47.3 0.0 821:16.90 ny13
8550 user2 25 0 14384 1996 820 R 45.4 0.0 803:58.95 ny6
9335 user2 25 0 17080 4688 800 R 45.4 0.0 784:37.43 b6
12682 alpaslan 15 0 12864 1184 708 R 5.7 0.0 0:00.04 top
1 root 15 0 10344 688 572 S 0.0 0.0 0:02.79 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.05 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/1
.
.
.
.
6015 root 15 0 159m 2540 1948 S 0.0 0.0 0:00.01 gdm-binary
6109 root 18 0 188m 2356 1664 S 0.0 0.0 0:00.00 gdm-binary
6111 root 18 0 245m 4124 3404 S 0.0 0.0 0:00.00 gdm-rh-security
6114 root 15 0 98604 5968 3976 S 0.0 0.0 0:03.60 Xorg
6137 gdm 15 0 216m 16m 6944 S 0.0 0.1 0:00.19 gdmgreeter
12089 root 16 0 88072 3564 2848 S 0.0 0.0 0:00.02 sshd
12091 alpaslan 15 0 88216 1736 1012 S 0.0 0.0 0:00.20 sshd
12092 alpaslan 15 0 66044 1608 1188 S 0.0 0.0 0:00.07 bash
12120 alpaslan 15 0 12868 1288 808 T 0.0 0.0 0:00.03 top
12121 root 18 0 88072 3224 2512 S 0.0 0.0 0:00.01 sshd
12123 alpaslan 15 0 88216 1764 1012 S 0.0 0.0 0:00.73 sshd
12124 alpaslan 15 0 54000 2088 1528 S 0.0 0.0 0:03.01 sftp-server
12638 alpaslan 15 0 12868 1300 812 T 0.0 0.0 0:00.03 top
12639 alpaslan 15 0 12868 1296 812 T 0.0 0.0 0:00.08 top
12675 alpaslan 15 0 12868 1304 816 T 0.0 0.0 0:00.09 top
12676 alpaslan 15 0 12868 1300 812 T 0.0 0.0 0:00.04 top
12677 alpaslan 16 0 63512 796 664 T 0.0 0.0 0:00.00 less

I'm not familiar with fortran, but in general there can be a few possibilities:
1. the process ends normally (that do not means it successfully completed its job, maybe it exited because of some error), you need to add traces to the sources to find out what caused it.
2. the process runs into fatal error (segmentation fault or similar) and stopped. there can be some error outputs related to it, the parent shell should collect such messages.
3. the process got a signal from outside. In this case you will need a signal handler to log the event, but some signals cannot be catched at all.
4. the host was crashed because of kernel panic or power outage (or similar), but in this case you will see the reboot.

you may start the process with strace and look at the log, but in this case the process will be slow, really slow.

Anyway you can get any result only by watching that process, or searching in log files.

thank you very much for your reply pan64,

what the program does is: it is simulating a lattice, consisting of 50x50x50 points in space. it's construcitng that lattice, using a 3-dimensional array, and then making changes on it by going through the whole lattice, all this many thousand times. in doing this it's producing 60 files as output, each of these 60 files having a file size of 275 KB.

i'm running the same program at home, on an openSUSE distribution (12.1) which i installed as a 2nd OS besides windows, and it's running "perefctly". at home i never enountered any problems.

so i can surely say that #1 (process ending normally) is out of question, like #2 (running into fatal error). sometimes, -depending on the change f a variable in the program- it doesn't produce all of the 60 files, instead only 55 or 50; but even in this case, the program/process continues to run, not exiting, only stopping to produce the output files.
#4 happens very rarely (system reboot - after power shortage) but in most cases there was no system-reboot.

so it's either an external signal, or (from what i've found on surfing google) an "out of memory killer"(?) - in other words, overcommitting memory(?) , don't know if that's the correct description of it.

thank you again pan64,

cheers,

alp

I dont know, if this is the issue. You start the program through ssh? If so, when your session ends, you process will be terminated. If this is the case, you can check nohup command.

Regards,

You are right, that can be a case, but in this case ssh will send a signal to that process and that signal can be caught. What I suggest is to catch the signal and try to find where is it coming from.

Very likely its running up against the default limit setting for cpu time per job.
See /etc/security/limits.conf, in which case see your SysAdmin

thanks for your replies pan64, The_eXXe and chrism01,

i'm running those programs through "putty" the commands i use are:

"ifort main.f -o mx" for compiling the fortran code and getting an executable with the name "mx", which is the same you can see from the top command snapshot i posted.

and for running the program i use the command : "nohup ./mx&" so that it continues running after i close the terminal window (putty).

@chrism01: i will have a look at that file as soon as possible. hopefully i don't have to be root in order to look into that file.

besides that; is there a way for me to find out, by looking into a log file -or any other way- what killed the process? i mean, is there a file which keeps track of who or what killed a process?
and: if such a way/file exists, how do i get there? (what are the commands i need to use or the files i need to look into?)

thanks again :)

alp

No there is no such feature in general, but on some systems it is implemented. The easy way to redirect stdout and stderr into file, maybe you will see something in it:

Code:

nohup ./mx >/my/home/log/mx.$$.stdout 2>/my/home/log/mx.$$.stderr &

and also you can try

Code:

nohup strace -o /my/home/log/mx.$$.strace ./mx >/my/home/log/mx.$$.stdout 2>/my/home/log/mx.$$.stderr &

and you will see how the program stopped.

If a process was killed because a quota was exceeded then you might get an error message. For some limits that PAM may enforce, such as execution time, you may need to find the answer in the kernel log. If you don't have root access, you may need to ask a sys admin if there were any such messages. There are also file system quotas, and Linux policies that may cause your program to fail.

There are hard limits and soft limits. A soft limit is one you may be able to increase yourself, up to but not exceeding the hard limit. Enter ulimit to see what your limits are. Enter "help ulimit" for help information.

For file system quotas, it may be configured with soft and hard limits. You can exceed the soft limit temporarily, for a grace period, after which the soft limit is exceeded. It's possible for a program to create temporary files, in /tmp, perhaps. The program exceeds the quota, which is enforced 12 or 24 hours later. When the program aborts, the files may be gone.

You might consider using screen instead of nohup. This will allow you to log in later and reattach to the console. If error messages where printed, you can see them. (This won't work if the screen session itself was terminated)

hi there,

as i was running the same program - which is causing me so much trouble at the university - at home too, i got the idea to take a look into the "nohup.out" file which is being produced (probably because i run the nohup command?).

as it was a file which continued running without producing any output files i had to kill the process myself. (depending on a variable it either produces all the 60 output files and terminates itself, or it produces less than 60 files and i have to terminate it myself.)

after having killed the process, the nohup.out file was containing the output it would normally print on screen (in case i would have not run tyhe nohup command, instead running it on the foreground) and at the last few lines some output which says "SIGTERM" plus an error(?) code in front of it.

is it possible to find out with that error code wether linux (or the kernel) killed the process, or it was killed by a CPU/memory limitation, or by a person (the root/sysstem adminisitrator) ?

thanks in advance,

alp

Hi,
sigterm means your process got a signal to terminate. It can be either you (the owner of the process) or the root, but noone else. you means any of your processes, maybe the parent process, or another process. If you could catch the signal you may have the chance to handle it (sigterm can be also ignored).
Would be better to show us that message instead of just writing the last few lines some output which says "SIGTERM" plus an error(?) code in front of it.

hi pan64,

as i said, it was just the nohup.out file of the process i ran at home, on my own linux and i killed it myself.

today i intend to go to the university and check the nohup.out files of the processes i'm running there, and this time i'll post the lines as they are :)

many thanks,

alp