[SOLVED] lock in libpthread occurs only on one Arch installation only with gcc-fortran
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
lock in libpthread occurs only on one Arch installation only with gcc-fortran
So, I have an unusual problem I first thought was a gfortran compiler bug. As far as can tell, however, it is specific to one machine's Arch install, and I can't reproduce it on another with Debian, or Manjaro, with the same kernel and compiler...(else I'd report it on GCC bugzilla)
So I'm posting it in General; it's a puzzle!
Using either 5.7.x, 5.8.x, or 5.9.x kernels on Arch and GNU Fortran (GCC) 10.2.0, we have a program calling a function as part of a write statement, where the function also has a write statement.
Code:
PROGRAM bugs
USE badwrite
x=AC(0)
write(*,*) 'x: ',x ! this works
write (*,*) '0: ',AC(0) !this does not
STOP
END
MODULE badwrite
CONTAINS
function AC(m2) result(c)
INTEGER,INTENT(IN) :: m2
write(*,*) m2 !killer statement with lapack or other linked library
c = m2+3
end function AC
END MODULE badwrite
compiled with
gfortran -c -llapack badwrite.f90
gfortran -llapack badwrite.f90 bugs.f90
should result in
0
x: 3.0000000
0: 0
3.00000000
However with the -llapack library (or the blas library, and possibly other external libraries)
the result is
0
x: 3.000000
(program hangs here)
Adding the -ggdb flag, running in gdb and interrupting with ^C results in
Code:
Starting program: /home/me/build/bugs.lib90/a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0
x: 3.00000000
^C
Program received signal SIGINT, Interrupt.
0x00007ffff6baddb0 in __lll_lock_wait () from /usr/lib/libpthread.so.0
Apparently the two writes are deadlocked?
Conditions:
(1) Only happens with an external library linked in, so
(2) Removing the write statement in the function AC also removes the hang.
(3) Separating the writes, as in the first statement with x=AC(0) then write(*,*) x, removes the deadlock/hang
(4) Another machine running the same kernel/gfortran version under Manjaro does not have the hang
(5) The problem does not occur with pgfortran (aka nvfortran) 20.7-0 LLVM on the machine in question.
(6) Changing kernels on the same machine does not solve the problem.
(7) Reinstalling packages with pacman, rebooting does not solve the problem.
In conclusion, it does seem to be a gcc-fortran bug with a race condition, but what triggers it on this machine is beyond me. I'd rather not reinstall the whole system, which is otherwise working perfectly.
Any ideas? I'm going to boot a live version of Manjaro on this hardware to see if it's a weird CPU bug.
I wouldn't say it is a CPU bug. Also I wouldn't try to reinstall the system - without knowing the reason.
You need to get (and post) the full stack trace, not only the current line. And also would be nice to see the other threads (if there were any).
strace also may help to gather info....
Thank you very much for replying. I was not hopeful, thinking I'd just have someone post that I had a broken system and should reinstall!
I did try booting the live version, and as expected by you and me, the bug vanished.. so not a hardware issue.
Could you tell me how to get the full stack trace, or point me to some relatively easy to understand instructions? I tried compiling with -fbacktrace, but have no idea what do with it, I also looked at https://gcc.gnu.org/onlinedocs/gcc-4...ugging-Options, which I found overwhelming.
I don't know what you mean by other threads. As you might surmise, I'm a bit of a Neanderthal coder (no insult meant to our ancestors). I found this bug/problem by putting "write" statements in my code as a means of debugging some code. So my level of debugging sophistication is limited.
I also have no idea how to use strace. I ran the fortran executable in one terminal, found its PID in another terminal and ran strace -p {the process PID} with the result
will save the strace output of the full execution. You can Ctrl^C as usual and check the result.
here you can find some info about backtrace (sometimes called stack trace): https://senarvi.github.io/stack-trace-with-gdb/
by the way if you could run it from the live version:
I would compare the libraries used (see ldd)
Here's the result of strace -o stack.txt -f ./a.out
I'm not sure why the AMDAPPSDK is involved, I do not believe it is installed anymore, though it was at some point in the past. [EDIT: just an old environment statement still in .bashrc under LD_LIBRARY_PATH]
I checked ldd a.out on the other computer running Manjaro, and got the same result without libcblas or the last three entries, all of which are probably related to libatlas installed on this machine but not the other one.
Last edited by mostlyharmless; 11-28-2020 at 12:53 PM.
Reason: found soemthing out
#0 0x00007ffff6baedb0 in __lll_lock_wait () from /usr/lib/libpthread.so.0
#1 0x00007ffff6ba7743 in pthread_mutex_lock () from /usr/lib/libpthread.so.0
#2 0x00007ffff7852940 in __gthread_mutex_lock (__mutex=0x55555555b950) at ../libgcc/gthr-default.h:749
#3 get_gfc_unit (n=6, do_create=1) at /build/gcc/src/gcc/libgfortran/io/unit.c:395
#4 0x00007ffff785105d in data_transfer_init (dtp=0x7fffffffdef0, read_flag=0) at /build/gcc/src/gcc/libgfortran/io/transfer.c:2851
#5 0x0000555555555215 in __badwrite_MOD_ac ()
#6 0x0000555555555391 in MAIN__ ()
#7 0x0000555555555412 in main ()
with the addtional lines below showing only one thread, as expected.
Code:
Program received signal SIGINT, Interrupt.
0x00007ffff6baedb0 in __lll_lock_wait () from /usr/lib/libpthread.so.0
Thread 1 (Thread 0x7ffff6b9b2c0 (LWP 68386) "a.out"):
So, a lot more information, thanks again for your help. Very interesting stuff in these traces, but to me, still pretty opaque.
Last edited by mostlyharmless; 11-28-2020 at 12:38 PM.
Reason: additional info
ok, thanks
would be nice to see how this strace looks like when it works. I do not really understand why futex was involved.
the other thing you can check is the libraries:
if the libs used by a.out are identical in the two cases (the output of ldd). At first you would need to check the size of them by ls -l /usr/lib/whatever
you are right. As you can see there is no futex call in that strace log. That would be a good starting point to find that bug (not for you, just in general).
Ultimately the reason given for this bug was that the nested "write (*,*)" statements made by the direct function call to AC are considered "recurrent" calls to the write(*,*) function. Apparently the fortran standard does not allow recurrent calls to "I/O", so the behaviour of the program is undefined... I would make the distinction between a nested call and a recurrent call, but apparently the writers of the gfortran compiler do not interpret the fortran standard that way.
While that is a technically valid answer, it's not a particularly graceful way for the gfortran compiler to "handle" or ("not handle") the situation, particularly since it only occurs with linkage to a particular library! It just seems sloppy. This lack of grace seems to me to be a development that coincides temporally with the fork/rewrite of gfortran back in the early 2000's. I had a similar non-excuse from bugzilla for the -fbounds-check parameter to not work in some circumstances. Apparently if your code has errors (like an out of bounds error) the compiler is free to hang the process or do whatever -- even if the explicit purpose of the fbounds-check is to check for out of bounds errors in your code. In fact, one responder on bugzilla said that if given incorrect code the compiler was free to reformat your disk. Perhaps it was said somewhat in jest, but that seems to reflect the underlying attitude accurately. I can't know for sure but it seems to me that the writers of g77 would never had such a notion, or at least not proffered it as a defense.
As an aside and a bit of a rant or whine: the writers of g77 were old, like me; the discourtesy of the culture is wearing on me.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.