Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
I am at a Gigabyte X79-UD3, Intel i7-3930K, and two GTX-680, used on Debian amd64, at the Linux promt (no X-server) for number crunching. Since a few days, I am having problems in carrying out long computations, which gave no problem before. I am wondering whether it deals of hardware problems, so that I would like to understand which is which of the two GPU cards.
First, how the computation is launched, with respect to the GPUs (log file from the computational code, where gig64 is the machine name):
Pe 1 physical rank 1 will use CUDA device of pe 2
Pe 4 physical rank 4 binding to CUDA device 1 on gig64: 'GeForce GTX 680' Mem: 2047MB Rev: 3.0
Pe 3 physical rank 3 will use CUDA device of pe 4
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX 680' Mem: 2047MB Rev: 3.0
Did not find +devices i,j,k,... argument, using all
Pe 0 physical rank 0 will use CUDA device of pe 2
Info: 8.22656 MB of memory in use based on /proc/self/stat
Second, the error message, arising after some 50,000 to 600,000 steps of normal computation:
FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2 (gig64 device 0): unspecified launch failure.
Third, the information I was able togather about the hardware:
Nvidia-smi is cryptic about Kepler, as far as I can understand. Linux itself does not tell anything about the GPUs, as far as I could understand.
My specific question is how to link the launch/failure to the info about the GPUs. I can expect that there are other commands to reveal more. In short, I am trying to understand from physically which GPU came the pe 2 error, in order to replace the card, if there is any indication that it is damaged.
I must finally add that the system under investigation is particularly sensitive. Problably the sensitive step is the integration of Newton's equation of motion. In fact, if I limit the integration to 0.01 femtoseconds, the hardware works better than with integration along 0.1 femtoseconds.
At any event, being able to physically indentifying the cards with respect to the software used is a must.
Thanks a lot for you patience in reading this long post. May be that a clarification will be used to others for other purposes.
Incidentally, if the computation is launched from a gnome terminal window, usage of memory escalates during the computation, taking up nearly all 16GB. Which always occurred and does not occur if lauching from the Linux prompt, with no X-server.
May I suggest
1. Extract one card, noting the slot
2. Get a readout of whether slot 0 or slot 2 has vanished.
3. Crunch some numbers and see do errors come
4. Insert the second card (I would use a different slot, if possible).
5. Repeat 3.
While I thank you for your kind answer, the standard way of cheching cards, one at a time, would be very time consuming. The error may arise after some 10^6 steps. Also, it is the combination of cards that is of interest.
From my side, I was sure that the error message tells that it comes from GPU0. I left that blank, in the hope to have independent confirmation here from GPU hardware experts.
So, you have the errors from card 0. That tells you which one the system is having trouble with. You need to validate that it works without error on the other card, imho - otherwise you haven't cleared the everything else. You have memory, heat, access times, etc to clear as suspects. Nvidia are inclined to write what they like for linux these days. If something threw an error after 10^6 calculations on maths processing, would nvidia notice or care?
You say the system is "sensitive." Sensitive to what and in what way?
You seem to think that an expert can pick out a flaw without breaking out a sweat. Experience sometimes allows that but you're off the beaten track here. I'm fairly hardware 'expert', but never infallible. There's a lot of background knowledge, exact details and familiarity with the hardware & software, inspiration & perspiration in the formula. What are your cpu & gpu temperatures like under load?
Just after your last mail, 14 April, the error disappeared, both GPUs working perfectly. Now, on a different simulations of the same type, suddenly the error appered again. Exactly as before. It repeats with previously successful files, shortly after the launch (i.e., after about 30,000 steps of the calculation, which can be termed "near immediately" in computations of this type)
I have not yet fully carried out what you suggested. However, I have extracted the cards and taken notice.
The one on PCIEX16-1 reads, on the left (on a superimposed label):
and, on the right (on a superimposed label):
EAN 4 711072 257415
UPC-A 8 16909 09568 5
The other one, on PCIEX16-2 reads, on the left (directly on the board, no superimposed label):
and, on the right (on a superimposed label):
S/N 912-V801-1233B1204006883 (I confirm, S/N is given twice with different numbers)
EAN 4 719072 260156
UPC-A 8 16909 09645 3
Clearly, there is no relationship with the ID derived from "nvidia-smi -L", which I reported on the original mail.
Then, I inverted the cards ("nvidia-smi -L" reports - inverted - the same numers given before.).
Launching the simulation, the same type of error (unspecified launch failure on one GTX) occurred again, "nearly immediately" as above. However, now the error is on Pe4 (gig64 (computer name) device 1) [here Pe2, as before, or Pe4, as now, are the blocks of the system distributed for computation according to the parallelization.
It seems clear to me that the GTX that was previously at PCIEX16-1, and now at PCIEX16-2 (S/N-602-V282-015B1204050 555) is the one giving the error. And the mainboard is not responsible for malfunctions. Do you agree?
You may wonder why I have not carried the computation on a single GTX. I'll do that, according to your suggestion, but I first wanted to see what occurs when the hardware is complete.
I thank you very much for previous suggestions and for any suggestion that you may want to offer. Should you believe that I must anyway first check the simulation on single cards, please say simply that. I'll do.
That will give you so much info you'll be sick of it. Other stuff in /proc/dri, and /sys/bus/pci_express, but watch yourself there because a lot of stuff is write only :-/. You can adjust the log level in syslog if necessary for more info. at boot-up, see if you can link one card in any to one slot as it will hardly find the two of them together. In fact if you put in _any_ other card, that will become easy for you
pcie-0 radeon card
pcie-1 nvidia 680 - you suddenly know which is which.
If The biggest thing to check is the soldering on the boards.
The brand matters little, imho, because there's a Nvidia gpu slapped onto a board somewhere in China and given to you. Look closely for cracks in solder, grey patches, or obviously unwise board layout Soldering is difficult in the vicinity of something that will absorb a lot of heat, a thermal 'black hole.'
Grey and dull (not shiny) solder speaks of overheating. Too little solder is a heat problem also. All joints should be convex, but have solder. Concave joints mean to little heat was around. Black is a sign of arcing. Overheated solder becomes resistive, dissipates power and eventually ruptures - the well known dry joint.
It is possible your reseating of the boards adjusted something physically. Try re-seating the faulty one. It could be a physical problem . . . which would probably be soldering.
you're over my head with the computational stuff there, because I haven't suffered that way yet. You would have saved time by buying another card, and trying substitution blindly.
Re-checked by moving the GPU along the slots. The error messages always point to the GPU identified before. Sent for replacement under guarantee.
I thank you for all advice in an area which is blurred by nvidia secrets. The statistical dynamics code I currently use (NAMD) is not GNU but it is free and the code is provided. However, it is bound to CUDA and, inter alia, we are still waiting for consumer mainboards that can exploit the PCIEX 3.0 nvidiaGPUs.
The alternative European statistical dynamics code (GROMACS), which is GNU, in fact available directly with Debian, is making much progress with GPUs (not yet to Kepker), based on OpenCL. Other that allowing for less expensive hardware, it allows Linux insight into the hardware, and offers a much more diversified code (scientists using it normally provide the code they develop). It will be - I hope - for my next CPU-GPU hardware.