stopped processes - finding out who/what killed them?
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
stopped processes - finding out who/what killed them?
i can be regarded as a newcomer to linux, started learning and using linux within the last 2-3 months, because of some troubles i had during my M.Sc. studies at the university.
my problem is as follows:
i'm working a simulation program written in the fortran language. depending on the size of the system being simulated, the program needs to run for several days (typically 3-7 days). i usually start 10 simulations at one time. each program generating 60 files which are about 270 KB in size, and some smaller files which take a total space of about 20 MB.
i go to the university and check the programs (processes) almost every day and find out that the processes are not running anymore, that they have been stopped/killed. checked with the "last" command, making sure that there has not been a system re-boot. it's very very annoying not being able to make any progress, and if i can't find out and fix what is happening, it is going to cost me a whole semester :s
when i use the "top" command, i can see that all the 10 processes share the available CPU and each of the processes is using only 0.1 % of memory.
so: what/who is continously killing my processes and how can i find out what happened? (by the way: i'm not the system root)
your help will be much appreciated.
thanks in advance,
additional info: (after the reply of pan64)
i'm logging in to the server (where i run my simulations) from a nearby pc running windows xp, by using "putty" and "SSHClient" for file transfer. i thought it might have anything to do with the system or hardware (CPU/RAM). so here is what i found out:
CPU info (using the command cat /proc/cpuinfo)
the system has 24 processors and all have in common the following properties.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
stepping : 2
cpu MHz : 2661.000
cache size : 12288 KB
physical id : 0
siblings : 12
core id : 0
cpu cores : 6
apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida pni monitor ds_cpl vmx smx est tm2 cx16 xtpr popcnt lahf_lm
bogomips : 5337.46
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
I'm not familiar with fortran, but in general there can be a few possibilities:
1. the process ends normally (that do not means it successfully completed its job, maybe it exited because of some error), you need to add traces to the sources to find out what caused it.
2. the process runs into fatal error (segmentation fault or similar) and stopped. there can be some error outputs related to it, the parent shell should collect such messages.
3. the process got a signal from outside. In this case you will need a signal handler to log the event, but some signals cannot be catched at all.
4. the host was crashed because of kernel panic or power outage (or similar), but in this case you will see the reboot.
you may start the process with strace and look at the log, but in this case the process will be slow, really slow.
Anyway you can get any result only by watching that process, or searching in log files.
what the program does is: it is simulating a lattice, consisting of 50x50x50 points in space. it's construcitng that lattice, using a 3-dimensional array, and then making changes on it by going through the whole lattice, all this many thousand times. in doing this it's producing 60 files as output, each of these 60 files having a file size of 275 KB.
i'm running the same program at home, on an openSUSE distribution (12.1) which i installed as a 2nd OS besides windows, and it's running "perefctly". at home i never enountered any problems.
so i can surely say that #1 (process ending normally) is out of question, like #2 (running into fatal error). sometimes, -depending on the change f a variable in the program- it doesn't produce all of the 60 files, instead only 55 or 50; but even in this case, the program/process continues to run, not exiting, only stopping to produce the output files.
#4 happens very rarely (system reboot - after power shortage) but in most cases there was no system-reboot.
so it's either an external signal, or (from what i've found on surfing google) an "out of memory killer"(?) - in other words, overcommitting memory(?) , don't know if that's the correct description of it.
thanks for your replies pan64, The_eXXe and chrism01,
i'm running those programs through "putty" the commands i use are:
"ifort main.f -o mx" for compiling the fortran code and getting an executable with the name "mx", which is the same you can see from the top command snapshot i posted.
and for running the program i use the command : "nohup ./mx&" so that it continues running after i close the terminal window (putty).
@chrism01: i will have a look at that file as soon as possible. hopefully i don't have to be root in order to look into that file.
besides that; is there a way for me to find out, by looking into a log file -or any other way- what killed the process? i mean, is there a file which keeps track of who or what killed a process?
and: if such a way/file exists, how do i get there? (what are the commands i need to use or the files i need to look into?)
If a process was killed because a quota was exceeded then you might get an error message. For some limits that PAM may enforce, such as execution time, you may need to find the answer in the kernel log. If you don't have root access, you may need to ask a sys admin if there were any such messages. There are also file system quotas, and Linux policies that may cause your program to fail.
There are hard limits and soft limits. A soft limit is one you may be able to increase yourself, up to but not exceeding the hard limit. Enter ulimit to see what your limits are. Enter "help ulimit" for help information.
For file system quotas, it may be configured with soft and hard limits. You can exceed the soft limit temporarily, for a grace period, after which the soft limit is exceeded. It's possible for a program to create temporary files, in /tmp, perhaps. The program exceeds the quota, which is enforced 12 or 24 hours later. When the program aborts, the files may be gone.
You might consider using screen instead of nohup. This will allow you to log in later and reattach to the console. If error messages where printed, you can see them. (This won't work if the screen session itself was terminated)
Last edited by jschiwal; 03-29-2012 at 10:37 AM.
Reason: nohup -> bogus autocorrect fail
as i was running the same program - which is causing me so much trouble at the university - at home too, i got the idea to take a look into the "nohup.out" file which is being produced (probably because i run the nohup command?).
as it was a file which continued running without producing any output files i had to kill the process myself. (depending on a variable it either produces all the 60 output files and terminates itself, or it produces less than 60 files and i have to terminate it myself.)
after having killed the process, the nohup.out file was containing the output it would normally print on screen (in case i would have not run tyhe nohup command, instead running it on the foreground) and at the last few lines some output which says "SIGTERM" plus an error(?) code in front of it.
is it possible to find out with that error code wether linux (or the kernel) killed the process, or it was killed by a CPU/memory limitation, or by a person (the root/sysstem adminisitrator) ?
sigterm means your process got a signal to terminate. It can be either you (the owner of the process) or the root, but noone else. you means any of your processes, maybe the parent process, or another process. If you could catch the signal you may have the chance to handle it (sigterm can be also ignored).
Would be better to show us that message instead of just writing the last few lines some output which says "SIGTERM" plus an error(?) code in front of it.