regression testing FORTRAN code SuSE vs RH 6.4 getting very different answers
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
regression testing FORTRAN code SuSE vs RH 6.4 getting very different answers
I have some legacy code I need to port to RH6.4 from SuSE 10 Patch 3.
I have three machines: Suse AMD (SAMD), RH6.4 Intel (RHI) and RH6.4 AMD (RHA). The two AMD machines use the same chipset.
On SAMD I get an answer of -0.108E-00
On RHI and RHA I get an answer of -0.106E+02 ~ two orders of magnitude off!!
The SAMD answer is the correct answer. I presume libm is the difference between SuSE and RH6.4. The code was compiled on RHI and executed on all three machines. I get the same results with gfortran, Intel 13.1.146 and PGI 13.4; the results on SuSE differ from those generated by RH. I compiled with -O0 and still the difference remain. Is there a bug in libm on RH? LDD does not provide much useful information in this case. I'm looking for suggestions as to how to determine why I'm getting such a huge difference and how I can resolve them.
Thank you for the reply. I am unable to post the code due to its proprietary nature. I can tell you the source is written FORTRAN 90. I get the same results with gcc, Intel 13.1.146 and PGI 13.4. The only difference I'm able to detect is the OS. RH on AMD and RH on Intel yield the same answers, but they do not match the SuSE answers on AMD. I have no SuSE on Intel, however. I believe I've isolated the difference to a call to matmul and because I only have an hour or so per day to work on the problem, I cannot say for certain that matmul is indeed the problem. Stay tuned and I'll post my progress.
How big is the code? For issues like this I often dive in and print out a few key variables in specific locations in the code, then use those to trace down where the two machines begin to differ in their calculations.
You may find that the problem is due to a bug in the code, perhaps using a 32-bit float where more precision is required, and you're getting a roundoff or floating point approximation error that's then propagating into the final result.
I've also run into issues where poor programming practices cause different compilers or machines to treat the code differently, yielding drastically different results.
Unfortunately, being in a science-dominated field, I run into this MUCH more often than I should when dealing with code from 3rd parties. They're often written by physicists with little or no programming background who are just hacking their way through a language they barely understand until they get an answer that looks right-ish. The end result is a code that barely works on their machine, and then explodes as soon as you change the compiler or architecture.
Last edited by suicidaleggroll; 05-23-2013 at 06:13 PM.
suicidaleggroll, thank you for the tip about the poor programming skills of certain professionals. After nearly 30 years of working at Sandia Naional Labs and Los Alamos, poor programming should have been my first thought. Also, I thought I had array bounds checking enabled in my make file, but it was not. Long story short, after writing out the contents of several variables, I confirmed the difference in values was the result of a call to matmult. I replaced that call with BLAS' gemm. After this change the code generated a segfault. I reviewed the make file and discovered I had not enabled 'check all' (Intel v 13 compiler) and found a line where the code was trying to read the 0th element of an array. The index was generated via some convoluted sequence of math, so the code has been returned to the developer for correction.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.