why program terminates?

manojg · 06-06-2004, 04:21 PM

Hi,

I was running a fortran code in Redhat Linux system. It was supposed to take about 6.5 hours. But it was terminated earlier with incomplete number of data.

I thought it was due to cpu time limit. I checked it by command < ulimit -t >. The time limit is "unlimited" because it is my own computer.

Then I run another fortran code. It was supposed to take about 7.0 hours. It run for complete time and produced complete data.

So, I am puzzled. I couled fix this problem. Could you please help me to fix this. I am using Pentium IV.

Thank you very much.

Manoj Gupta

jailbait · 06-06-2004, 04:47 PM

"Could you please help me to fix this."

I don't know why your program terminated early.

You should consider putting checkpoints in your program, say a checkpoint every half hour where you write your intermediate results to a disk file. And then add the ability for the program to resume running from any checkpoint. That way if the program fails during a long run you can resume from the last checkpoint and only have lost about 15 minutes work.

___________________________________
Be prepared. Create a LifeBoat CD.
http://users.rcn.com/srstites/LifeBo...home.page.html

Steve Stites

manojg · 06-07-2004, 12:52 PM

Hi Steve ,

Thank you for your suggestion. This can help to save the time.

Actually, I am curious to know why the program terminates. I am writing a little bit about the program.

In the program there are two do loops. like:

do 10 i = 1.1, 2.0, 0.1
do 20 j = 1, 200000

So, it should produce 200000 points for each 1.1, 1.2, ...... 2.0(for each i). But it produced for i = 1.1, .....1.9 only. So there were total 1800000 points instead of 2000000.

I put the same program in three different computers. In all computers, it produced same number of points(1800000).

In the same program, I just change the range of the do loop like:

do 10 i = 0.0, 1.0, 0.1
do 20 j = 1, 200000

In this case, it produced all points ( 2200000) although this took more time.
So, I am puzzled why the same program behaves differently.

I appreciate your help.

Manoj

jailbait · 06-07-2004, 02:14 PM

"I am puzzled why the same program behaves differently."

I do not think that you have enough information to solve the problem. You need to run some tests to find the error. Obviously you do not want to run for hours just for a test. So I suggest that you try this test:

do 10 i = 1.9, 2.0, 0.1
do 20 j = 1, 200000

Also you need to get some significant error messages. You can do this by placing some diagnostic messages in the program and recompiling it. Give some thought about where to place the diagnostic messages so that you are not flooded with millions of messages.

---------------------
Steve Stites

Dark_Helmet · 06-07-2004, 02:37 PM

Quote:

Originally posted by manojg
<snip>
In the program there are two do loops. like:

do 10 i = 1.1, 2.0, 0.1
do 20 j = 1, 200000

So, it should produce 200000 points ...
</snip>

<snip>
In the same program, I just change the range of the do loop like:

do 10 i = 0.0, 1.0, 0.1
do 20 j = 1, 200000

In this case, it produced all points ...
</snip>

I'm no expert in FORTRAN (I used it many, many moons ago), but my instinct says that these two loops are not equivalent.

I'm assuming that do loops are of the form: variable = start, stop, increment

If that is the case, your start and stop conditions are not identical. In the first example (the one that produces 180000 points), your start and stop are 1.1 and 2.0 respectively (meaning a difference of 2.0 - 1.1 = 0.9; an increment of 0.1 means the nested do-loop will be executed 0.9 / 0.1 = 9 times).

For your second example, start and stop are 0.0 and 1.0 respectively (meaning a difference of 1.0 - 0.0 = 1.0; an increment of 0.1 means the nested do-loop would be executed 1.0 / 0.1 = 10 times)

So, having analyzed that, I assume your nested do-loop generates the "data points" (one data point for each time through the loop). So the first example would generate 20000 * 9 = 180000 data points. The second example would generate 20000 * 10 = 200000 data points.

Again, my understanding might be slightly skewed from not using FORTRAN in a while, but my gut tells me your problem lies in having different do-loop ranges.

jailbait · 06-07-2004, 03:36 PM

Dark_Helmet

A Fortran do loop executes both the start and stop points so the number of loop iterations is ((stop - start)/increment) + 1 when (stop - start) is evenly divisible by increment and ((stop - start)/increment) when (stop - start) is not evenly divisible by increment.

------------------------------------------
Steve Stites

Dark_Helmet · 06-07-2004, 04:08 PM

My apologies. I should think twice before getting into something I'm not familiar with

Admittedly, I skimmed the post and it seemed to me the perceived problem was the number of data points compared to the do-loop ranges. When I saw the ranges were different, I thought that was the obvious core of the problem (just a simple coding mistake that I fall prey to sometimes).

That being the case, my only suggestion would be to look through the block of code and double-check your references to the "i" variable. Assuming that both examples used the exact same code and the first fails while the second works... that tells me the do-loop range for "i" (the only thing that changed) is causing your problem (or perhaps another variable that takes its value from i).

Other than that, I'll quietly excuse myself from the discussion