Strange value of the double type variable: -nan(0x8000000000000)

915086731 · 05-29-2014, 10:50 PM

I am confused by the value of "currdisk->currangle" after adding operation. Initially the value of "currdisk->currangle" is 0.77500000000000013, but after adding operation, it's changed to "-nan(0x8000000000000)", Can anyone explain ? Thanks! The following is the occasion of gdb debugging.

Code:

3338          currdisk->currangle += (simtime - seg->time) / currdisk->rotatetime;
(gdb) p currdisk->currangle
$28 = 0.77500000000000013
(gdb) p (simtime - seg->time) / currdisk->rotatetime
$29 = 0.00833333333333325
(gdb) p (simtime - seg->time) 
$30 = 0.092592592592591672
(gdb) p currdisk->rotatetime
$31 = 11.111111111111111
(gdb) n

(gdb) p currdisk->currangle 
$32 = -nan(0x8000000000000)
(gdb) p/x (char[8])currdisk->currangle 
$52 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf8, 0xff}
(gdb)

Then I change

Code:

currdisk->currangle +=  (simtime - seg->time) / currdisk->rotatetime ;

to

Code:

 double tmp1 = (simtime - seg->time) / currdisk->rotatetime; 
currdisk->currangle += tmp1;

. The value of currdisk->currangle is normal. Can anyone explain the confusing phenomenon ?

pan64 · 05-30-2014, 01:29 AM

can you tell us the types of these variables (int, float, double, whatever?)

915086731 · 05-30-2014, 02:21 AM

All is double type
assembly code

Code:

       tmp1 = (simtime - seg->time) / currdisk->rotatetime;
0x0808fdf4  <disk_buffer_sector_done+559>:  fldl   0x80b2208
0x0808fdfa  <disk_buffer_sector_done+565>:  mov    -0x38(%ebp),%eax
0x0808fdfd  <disk_buffer_sector_done+568>:  fldl   (%eax)
0x0808fdff  <disk_buffer_sector_done+570>:  fsubrp %st,%st(1)
0x0808fe01  <disk_buffer_sector_done+572>:  mov    0x8(%ebp),%eax
0x0808fe04  <disk_buffer_sector_done+575>:  fldl   0xc4(%eax)
0x0808fe0a  <disk_buffer_sector_done+581>:  fdivrp %st,%st(1)
0x0808fe0c  <disk_buffer_sector_done+583>:  fstpl  -0x28(%ebp)
      currdisk->currangle += tmp1;
0x0808fe0f  <disk_buffer_sector_done+586>:  mov    0x8(%ebp),%eax
0x0808fe12  <disk_buffer_sector_done+589>:  fldl   0x284(%eax)
0x0808fe18  <disk_buffer_sector_done+595>:  faddl  -0x28(%ebp)
0x0808fe1b  <disk_buffer_sector_done+598>:  mov    0x8(%ebp),%eax
0x0808fe1e  <disk_buffer_sector_done+601>:  fstpl  0x284(%eax)

metaschima · 05-30-2014, 11:20 AM

Try adding a '1.0 *' or '(double)' cast at the beginning of the calculation.

johnsfine · 05-30-2014, 12:46 PM

I think the problem must lie outside the information you posted.

My best guess is that gdb is showing you incorrect values. The compiler stores information to tell the debugger where local variables are stored at various points in the code. The debugger often misunderstands that info and/or the compiler stored it wrong. So currdisk in the first example you posted may not be where the compiler thinks it is.

You also made this harder by showing the disassembly for the version which seems to work, rather than for the version which seems to fail. Also context is required for understanding the behavior: a moderate amount before the point of failure and a little after.

Quote:

Originally Posted by metaschima

Try adding a '1.0 *' or '(double)' cast at the beginning of the calculation.

Random changes around a confusing issue just create more confusion. In the unlikely event you have some real justification for that suggestion, please explain.

metaschima · 05-30-2014, 01:38 PM

If the variables are actually not doubles but integers then it would all make sense. I don't see why these changes are random.

johnsfine · 05-30-2014, 03:25 PM

Quote:

Originally Posted by metaschima

If the variables are actually not doubles

Meaning you didn't see the post 9 hours before your post or didn't believe it?

Quote:

but integers then it would all make sense.

No it wouldn't. If you imagine different values (not different types) for those variables, you can get to a NaN. But the values are in the post, so you can see that even if they were integers, you won't get a NaN, but you can more clearly see they aren't integers.

Quote:

I don't see why these changes are random.

The OP's original code change should not change the result, but he thinks it did. You propose a different code change that also should not change the result. That is not a reasonable step in any systematic search for the cause.

To understand the situation, we probably need to disbelieve something in the original post. Maybe the OP is not accurately telling us what happened. More likely gdb did not accurately tell the OP what happened. But what to disbelieve must be filtered through some common sense and experience:

Code:

3338          currdisk->currangle += (simtime - seg->time) / currdisk->rotatetime;
(gdb) p currdisk->currangle
$28 = 0.77500000000000013
(gdb) p (simtime - seg->time) / currdisk->rotatetime
$29 = 0.00833333333333325
(gdb) p (simtime - seg->time) 
$30 = 0.092592592592591672
(gdb) p currdisk->rotatetime
$31 = 11.111111111111111

None of that looks like what we should consider disbelieving. gdb output shows those variables are not int's (or at least enough of them are not ints that the suggested cast would make no difference. gdb output shows the values are reasonable.

Code:

(gdb) n

There is something I would suspect (given the starting assumption that something must be distrusted). Did gdb really execute all and only the line of code that the post implies was executed at that point. gdb isn't perfect at that. We don't know what mode things were in. Maybe gdb proceeded to much later or (less likely) only part way.

Code:

(gdb) p currdisk->currangle 
$32 = -nan(0x8000000000000)
(gdb) p/x (char[8])currdisk->currangle 
$52 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf8, 0xff}
(gdb)

There is another thing I don't trust. Does gdb still know the location of currdisk? If gdb is wrong about the location of curdisk, it is wrong about the value of curdisk and showing garbage for currdisk->currangle. I don't trust that we know the true value of currdisk->currangle there.

metaschima · 05-30-2014, 03:42 PM

Then why not use printf instead of relying on gdb, that should exclude gdb from being the cause.

915086731 · 05-30-2014, 10:59 PM

Quote:

Originally Posted by metaschima

Then why not use printf instead of relying on gdb, that should exclude gdb from being the cause.

Printf also shows NaN

915086731 · 05-30-2014, 11:29 PM

Code:

tmp1 = (simtime - seg->time) / currdisk->rotatetime;
currdisk->currangle += tmp1;

The above code can also cause NaN in the later calling.

So I step into to the assembly.

Code:

tmp1 = (simtime - seg->time) / currdisk->rotatetime;
0x0808fdf4  <disk_buffer_sector_done+559>:  fldl   0x80b2208  //address of simtime
0x0808fdfa  <disk_buffer_sector_done+565>:  mov    -0x38(%ebp),%eax
0x0808fdfd  <disk_buffer_sector_done+568>:  fldl   (%eax)

...Here, the content of %eax is 0x80bdfeb, which is the address of seg->time.
After the above "fldl (%eax) " executed, the content of register st0 is 0x8000000000000000, which represents NaN. So the NaN is propagated to the following instructions. This is the key issue.

Code:

0x0808fdff  <disk_buffer_sector_done+570>:  fsubrp %st,%st(1)
0x0808fe01  <disk_buffer_sector_done+572>:  mov    0x8(%ebp),%eax
0x0808fe04  <disk_buffer_sector_done+575>:  fldl   0xc4(%eax)
0x0808fe0a  <disk_buffer_sector_done+581>:  fdivrp %st,%st(1)
0x0808fe0c  <disk_buffer_sector_done+583>:  fstpl  -0x28(%ebp)

pan64 · 05-31-2014, 02:02 AM

It looks like "everything is ok and correct but the result", so I would like to see that everything. I could not reproduce it, probably you may try to prepare a small but complete code to be able to check it. (from the other hand during the preparation you may find the reason).

johnsfine · 05-31-2014, 05:13 AM

I would rerun with a data breakpoint at the address of seg->time so you see each time it changes and can see where it changes to NaN.

If I take into account the claim that the problem was temporarily fixed by a code change that should have had no effect, then it sounds like a memory clobber: an unrelated section of code storing something into the location of seg->time when it was supposed to be storing somewhere else.

But the info in the first post doesn't seem to be consistent with the info in post #10, so I still think there is a big gap either between reality and what is reported by gdb or between what is reported by gdb and what was copied to this thread.

915086731 · 05-31-2014, 09:05 AM

Quote:

Originally Posted by johnsfine

But the info in the first post doesn't seem to be consistent with the info in post #10, so I still think there is a big gap either between reality and what is reported by gdb or between what is reported by gdb and what was copied to this thread.

The project reads requests and deals them.
To post #1, I introduce a temporary variable tmp1 which fixes the NaN value. but to post #10, I add more requests to the project, and the NaN occurs again. That means variable tmp1 can't fix issue after the requests provided changed.

pan64 · 05-31-2014, 09:16 AM

I think (but I cannot say I'm sure about that) you mixed your variables, you use the same name twice, or you use two different structs for the same thing, or there is a problem with the scope of them, maybe an alignment problem, or out of subscript error in an array. You may try to use valgrind, it can find that kind of issues. Is this a multi-threaded app?

915086731 · 05-31-2014, 09:22 AM

it's a single threaded project.
The seg->time is not changed at all. So I can't find any evidence that the memory of seg->time is polluted.