Originally Posted by bharatbsharma
i would like to give few more details which can help pinpoint root cause.
If you have a system running in near the same condition as it was when it failed, you can use tools such as top
to get an idea of the basic nature of the failure.
Obviously there is some circular reasoning there. You don't know what mattered to the failure so you don't really know what "near the same condition" means in my instructions above. Sometimes you just need to guess and look and think about what you see.
you might try typing Fo to bring processes with very high VIRT to the top. Is your tcl process one of them? Does its VIRT keep growing as it runs?
You will also see how much swap space you have and how much of that is free. If you have a few GB of swap space free or a large fraction of your Mem shown as "cached", a system wide commit limit problem is unlikely and your problem is more likely a memory leak inside the tcl code. If you have very little free swap space free, you might want to increase swap space anyway for safety and/or it might be the fix to your problem.
I call a proc in my tcl/expect framework. Every time this proc is called i spawn a bash. That is if i execute 1000 test cases this proc will be called 1000 times. ...
Is the error mentioned above is due to not managing "spawn bash" properly?
I barely know anything about TCL, certainly not enough to help you.
If the problem is a memory leak in the TCL stuff, rather than a system wide commit limit issue, you'll need help from someone who knows more about TCL.
Originally Posted by Valery Reznic
On the other hand, how many shell you spawned run simultaneously ?
Did one have a chance to finish it's job before another started ?
Or maybe that's the answer.
Maybe you are simply starting too many at once of whatever those bash instances do. Memory could be exhausted for that reason, or you might exhaust some other kernel resource that couldn't even be covered by adding swap space but might still appear to be memory when it fails.
Maybe when it works, you're just lucky that the TCL code spawning things doesn't get enough CPU time (competing against all the things it already spawned) to start too many more before some early ones finish. Maybe it fails if the TCL code happens to get a slightly bigger share of CPU time.