regression testing FORTRAN code SuSE vs RH 6.4 getting very different answers

rlsmithga · 05-21-2013, 05:31 PM

I have some legacy code I need to port to RH6.4 from SuSE 10 Patch 3.
I have three machines: Suse AMD (SAMD), RH6.4 Intel (RHI) and RH6.4 AMD (RHA). The two AMD machines use the same chipset.

On SAMD I get an answer of -0.108E-00
On RHI and RHA I get an answer of -0.106E+02 ~ two orders of magnitude off!!

The SAMD answer is the correct answer. I presume libm is the difference between SuSE and RH6.4. The code was compiled on RHI and executed on all three machines. I get the same results with gfortran, Intel 13.1.146 and PGI 13.4; the results on SuSE differ from those generated by RH. I compiled with -O0 and still the difference remain. Is there a bug in libm on RH? LDD does not provide much useful information in this case. I'm looking for suggestions as to how to determine why I'm getting such a huge difference and how I can resolve them.

rigor · 05-21-2013, 06:16 PM

Some additional information might be helpful, to help us to help you.

If the Fortran code is not proprietary and not extensive, please post it.

If you can't post it, please categorize what it does.

These are all x86 machines, yes?

The Fortran code merely making arithmetic calculations, yes?

If yes, are there are lot of calculations involved, or just a few?

John VV · 05-21-2013, 06:47 PM

Off hand i would guess that it is the differance from the rather old gcc on seld(s) 10 and the current gcc 4.6 on rhel6.4

by "legacy code" is this f77 or f95 ?

rlsmithga · 05-23-2013, 05:56 PM

Thank you for the reply. I am unable to post the code due to its proprietary nature. I can tell you the source is written FORTRAN 90. I get the same results with gcc, Intel 13.1.146 and PGI 13.4. The only difference I'm able to detect is the OS. RH on AMD and RH on Intel yield the same answers, but they do not match the SuSE answers on AMD. I have no SuSE on Intel, however. I believe I've isolated the difference to a call to matmul and because I only have an hour or so per day to work on the problem, I cannot say for certain that matmul is indeed the problem. Stay tuned and I'll post my progress.

suicidaleggroll · 05-23-2013, 06:10 PM

How big is the code? For issues like this I often dive in and print out a few key variables in specific locations in the code, then use those to trace down where the two machines begin to differ in their calculations.

You may find that the problem is due to a bug in the code, perhaps using a 32-bit float where more precision is required, and you're getting a roundoff or floating point approximation error that's then propagating into the final result.

I've also run into issues where poor programming practices cause different compilers or machines to treat the code differently, yielding drastically different results.

Unfortunately, being in a science-dominated field, I run into this MUCH more often than I should when dealing with code from 3rd parties. They're often written by physicists with little or no programming background who are just hacking their way through a language they barely understand until they get an answer that looks right-ish. The end result is a code that barely works on their machine, and then explodes as soon as you change the compiler or architecture.

rlsmithga · 06-08-2013, 05:05 PM

suicidaleggroll, thank you for the tip about the poor programming skills of certain professionals. After nearly 30 years of working at Sandia Naional Labs and Los Alamos, poor programming should have been my first thought. Also, I thought I had array bounds checking enabled in my make file, but it was not. Long story short, after writing out the contents of several variables, I confirmed the difference in values was the result of a call to matmult. I replaced that call with BLAS' gemm. After this change the code generated a segfault. I reviewed the make file and discovered I had not enabled 'check all' (Intel v 13 compiler) and found a line where the code was trying to read the 0th element of an array. The index was generated via some convoluted sequence of math, so the code has been returned to the developer for correction.

Again, thank you for the tip.