Understanding which GPU card is giving troubles

chiendarret · 04-06-2013, 12:26 PM

Hello:
I am at a Gigabyte X79-UD3, Intel i7-3930K, and two GTX-680, used on Debian amd64, at the Linux promt (no X-server) for number crunching. Since a few days, I am having problems in carrying out long computations, which gave no problem before. I am wondering whether it deals of hardware problems, so that I would like to understand which is which of the two GPU cards.

First, how the computation is launched, with respect to the GPUs (log file from the computational code, where gig64 is the machine name):
Pe 1 physical rank 1 will use CUDA device of pe 2
Pe 4 physical rank 4 binding to CUDA device 1 on gig64: 'GeForce GTX 680' Mem: 2047MB Rev: 3.0
Pe 3 physical rank 3 will use CUDA device of pe 4
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX 680' Mem: 2047MB Rev: 3.0
Did not find +devices i,j,k,... argument, using all
Pe 0 physical rank 0 will use CUDA device of pe 2
Info: 8.22656 MB of memory in use based on /proc/self/stat
***********************

Second, the error message, arising after some 50,000 to 600,000 steps of normal computation:
FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2 (gig64 device 0): unspecified launch failure.
*******************

Third, the information I was able togather about the hardware:

nvidia-smi -L
GPU0 UID 600f64d0-2996-8e71-dca8-8d66f139f772
GPU1 UID 704bb625-95a7-8779-cfdc-14a90e6581fc

nvidia-smi
driver v. 304.48
0 GTX 680 Bus-Id 0000:02:00.0 mem-usage 4% 89MB/2047MB
1 GTX 680 Bus-Id 0000:03:00.0 mem-usage 5% 93MB/2047MB
**********************

Nvidia-smi is cryptic about Kepler, as far as I can understand. Linux itself does not tell anything about the GPUs, as far as I could understand.

My specific question is how to link the launch/failure to the info about the GPUs. I can expect that there are other commands to reveal more. In short, I am trying to understand from physically which GPU came the pe 2 error, in order to replace the card, if there is any indication that it is damaged.

I must finally add that the system under investigation is particularly sensitive. Problably the sensitive step is the integration of Newton's equation of motion. In fact, if I limit the integration to 0.01 femtoseconds, the hardware works better than with integration along 0.1 femtoseconds.

At any event, being able to physically indentifying the cards with respect to the software used is a must.

Thanks a lot for you patience in reading this long post. May be that a clarification will be used to others for other purposes.

Incidentally, if the computation is launched from a gnome terminal window, usage of memory escalates during the computation, taking up nearly all 16GB. Which always occurred and does not occur if lauching from the Linux prompt, with no X-server.

chiendarret

business_kid · 04-07-2013, 03:22 PM

OK. You haven't made life easy for yourself.

May I suggest
1. Extract one card, noting the slot
2. Get a readout of whether slot 0 or slot 2 has vanished.
3. Crunch some numbers and see do errors come
4. Insert the second card (I would use a different slot, if possible).
5. Repeat 3.

chiendarret · 04-09-2013, 02:38 AM

While I thank you for your kind answer, the standard way of cheching cards, one at a time, would be very time consuming. The error may arise after some 10^6 steps. Also, it is the combination of cards that is of interest.
From my side, I was sure that the error message tells that it comes from GPU0. I left that blank, in the hope to have independent confirmation here from GPU hardware experts.
chiendarret

business_kid · 04-10-2013, 04:13 AM

So, you have the errors from card 0. That tells you which one the system is having trouble with. You need to validate that it works without error on the other card, imho - otherwise you haven't cleared the everything else. You have memory, heat, access times, etc to clear as suspects. Nvidia are inclined to write what they like for linux these days. If something threw an error after 10^6 calculations on maths processing, would nvidia notice or care?

You say the system is "sensitive." Sensitive to what and in what way?

You seem to think that an expert can pick out a flaw without breaking out a sweat. Experience sometimes allows that but you're off the beaten track here. I'm fairly hardware 'expert', but never infallible. There's a lot of background knowledge, exact details and familiarity with the hardware & software, inspiration & perspiration in the formula. What are your cpu & gpu temperatures like under load?

chiendarret · 05-05-2013, 11:34 AM

Just after your last mail, 14 April, the error disappeared, both GPUs working perfectly. Now, on a different simulations of the same type, suddenly the error appered again. Exactly as before. It repeats with previously successful files, shortly after the launch (i.e., after about 30,000 steps of the calculation, which can be termed "near immediately" in computations of this type)

I have not yet fully carried out what you suggested. However, I have extracted the cards and taken notice.

The one on PCIEX16-1 reads, on the left (on a superimposed label):
C416383447
N1996
and, on the right (on a superimposed label):
N680GTX-PM202GD6
S/N-602-V282-015B1204050 555
EAN 4 711072 257415
UPC-A 8 16909 09568 5
HDMI

The other one, on PCIEX16-2 reads, on the left (directly on the board, no superimposed label):
S/N 0421312026933
GTX 680
and, on the right (on a superimposed label):
N680GTX-PM2D2GD5
S/N 912-V801-1233B1204006883 (I confirm, S/N is given twice with different numbers)
EAN 4 719072 260156
UPC-A 8 16909 09645 3

Clearly, there is no relationship with the ID derived from "nvidia-smi -L", which I reported on the original mail.

Then, I inverted the cards ("nvidia-smi -L" reports - inverted - the same numers given before.).

Launching the simulation, the same type of error (unspecified launch failure on one GTX) occurred again, "nearly immediately" as above. However, now the error is on Pe4 (gig64 (computer name) device 1) [here Pe2, as before, or Pe4, as now, are the blocks of the system distributed for computation according to the parallelization.

It seems clear to me that the GTX that was previously at PCIEX16-1, and now at PCIEX16-2 (S/N-602-V282-015B1204050 555) is the one giving the error. And the mainboard is not responsible for malfunctions. Do you agree?

You may wonder why I have not carried the computation on a single GTX. I'll do that, according to your suggestion, but I first wanted to see what occurs when the hardware is complete.

I thank you very much for previous suggestions and for any suggestion that you may want to offer. Should you believe that I must anyway first check the simulation on single cards, please say simply that. I'll do.

chiendarret

chiendarret · 05-05-2013, 01:15 PM

I add, if relevant, that both cards are MSI, bought at the same time from the same local dealer, who ordered them for me.

Is MSI a good brand? I had before Zotac GPUs with no problems in number crunching.

thanks

business_kid · 05-05-2013, 02:20 PM

To identify them, run

Quote:

sudo lspci -vvn > some_file
less some_file

That will give you so much info you'll be sick of it. Other stuff in /proc/dri, and /sys/bus/pci_express, but watch yourself there because a lot of stuff is write only :-/. You can adjust the log level in syslog if necessary for more info. at boot-up, see if you can link one card in any to one slot as it will hardly find the two of them together. In fact if you put in _any_ other card, that will become easy for you
pcie-0 radeon card
pcie-1 nvidia 680 - you suddenly know which is which.

If The biggest thing to check is the soldering on the boards.
The brand matters little, imho, because there's a Nvidia gpu slapped onto a board somewhere in China and given to you. Look closely for cracks in solder, grey patches, or obviously unwise board layout Soldering is difficult in the vicinity of something that will absorb a lot of heat, a thermal 'black hole.'
Grey and dull (not shiny) solder speaks of overheating. Too little solder is a heat problem also. All joints should be convex, but have solder. Concave joints mean to little heat was around. Black is a sign of arcing. Overheated solder becomes resistive, dissipates power and eventually ruptures - the well known dry joint.

It is possible your reseating of the boards adjusted something physically. Try re-seating the faulty one. It could be a physical problem . . . which would probably be soldering.

you're over my head with the computational stuff there, because I haven't suffered that way yet. You would have saved time by buying another card, and trying substitution blindly.

chiendarret · 05-06-2013, 11:50 AM

Re-checked by moving the GPU along the slots. The error messages always point to the GPU identified before. Sent for replacement under guarantee.

I thank you for all advice in an area which is blurred by nvidia secrets. The statistical dynamics code I currently use (NAMD) is not GNU but it is free and the code is provided. However, it is bound to CUDA and, inter alia, we are still waiting for consumer mainboards that can exploit the PCIEX 3.0 nvidiaGPUs.

The alternative European statistical dynamics code (GROMACS), which is GNU, in fact available directly with Debian, is making much progress with GPUs (not yet to Kepker), based on OpenCL. Other that allowing for less expensive hardware, it allows Linux insight into the hardware, and offers a much more diversified code (scientists using it normally provide the code they develop). It will be - I hope - for my next CPU-GPU hardware.