Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Though a recurring question, I have a specific question. The error below occurs a debian 11 box six cores two gpus while attempting to run code that requires access to the gpus (which were permanently activated)
free
total used free shared buff/cache available
Mem: 16309048 1015008 14203664 9632 1090376 14991136
Swap: 39059452 0 39059452
I could provide the output of lshw or other
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 53 times over 0.005572 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 53 times over 0.005572 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
SPECIFIC QUESTION: This is a BIOS dependent debian installition. As far as I can understand, memtest does no more work from disk. Installed here, it seems to require an uefi installation. If true, could you please suggest a turnaround to check the integrity of memory?
PCI:0:1:0:0: is typically the address of a single NVidia graphics device.
PCI:0:0:2:0: is typically the address of the integrated Intel graphics.
To the best of my understanding, with newer hardware Nvidia graphics devices are supplements to the integrated graphics. Everything that reaches the monitor came through the Intel chip. Cuda is an NVidia thing.
Quote:
FATAL ERROR: CUDA
Pe 2 (gig64 device 0 pci 0:2:0)
Pe 4 (gig64 device 1 pci 0:3:0)
Looks to me like there's two NVidia devices trying to send data through two Intel graphics devices...
Because those addresses would be Intel addresses.
EDIT: Do they exist? the first one should.
Upgrading from debian 10 to 11 was carried out a couple of months ago. Which is probably when software problems were introduced. They came out only today because I usually use a remote cluster from my vintage laptop rather than this box.
Could you please suggest what should be best done? Or the info I should provide?
which deleted current 460.91.03 but reinstalled the same version, finishing with
DKMS install completed
at no avail, the box is very slowly responding as before.I can only agree with post #6 but the problem for me is finding the time. Perhaps faster to reinstall the system on non efi board
I mean that for kernel 5.10.0-10-amd64 it could only install that nvidia driver. Before this issue, I found that debian takes care to avoid any such mismatch. For me it was easier in the early 90 with debian.
nvidia-detect and nvidia-smi detect both GTX 680 as permanently activated, tell that they are supported by all nvidia drivers and that legacy drivers are only needed for cards from 400 downwards
I have also run nvidia-xconfig and 'dpkg-reconfigure -p low xserver-xorg'
The box boots quickly and everything flows rapidly, while trying a simulation with NAMD the same illegal access to mem occurs.
I have now submitted the problem to both Debian and NAMD forums because I should not be alone if the problem merely arises from software. From the web: old posts about problems with NAMD12 and nvidia, as well as for Debian 11 and nvidia, without details and, more importantly, solutions
Should that not help, I'll probably decide to remove debian 11 and install 10, if available, although with this raid1 box it requires attention, particularly by people that, like me, do system maintaining occasionally.
On further thinking, it is well possible that the issues that I described arise specifically for the molecular dynamics simulations with NAMD. Such simulations run correctly (albeit much too slow) on my vintage sony vaio with the same debian 11, and also (rapidly but costly) on the remote cluster.
On these basis, I am now testing the GPUs with glxgears, observing "301 frames in 5.o seconds= 60.005 FPS"
Does that suggest anything?
Unfortunately glmark2 is not in the repositories of debian. Instead of building it from source, to spare time could you suggest anothert tool for benchmarking the GPUs on linux?
thanks
Today new kernel 5.10.0-11-amd64, everything rebuilt while upgrading.
However, on trying the same simulation as before with namd, same error message
TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 10 times over 0.001123 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 10 times over 0.001123 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
Depending upon where you installed the nvidia driver from, its version is not likely to change unless you manually update it. In my experience, the driver downloaded from the nvidia site always has to be manually updated for each changed kernel and each driver version change. Thus if you downloaded and installed the 460.xx driver then unless you download the later driver (470.xx) you will always have the 460.xx driver on your system. This is true even if you recompile/reinstall it with the updated kernel.
The cuda driver also has to match the installed nvidia GPU driver since they work together.
Thanks for your comments. Since all this procedure has been automated by debian, I stopped downloading the nvidia driver and relied on debian to do that.
My interest in this two-680 box is limited because the suite (NAMD)that I use for computations has been only partly ported to GPUs. Forces are still computed by the CPUs. This means that when quantum mechanics comes into play, the only six cores of this box are insufficient (the minimum for the system I am investigating is 48 cores of a single node of the remote cluster.
Nonetheless, this box was useful in preparing the system for the remote cluster, as only classical mechanics come into play there. Before trying with new driver/kernel, to spare time I would like to carry out the most serious stressing of the GPUs that is today possible with linux. The test that I posted a few days ago (glxgears, observing "301 frames in 5.o seconds= 60.005 FPS") probably only tells that the GPUs are there and do their minimum service, but hardly more about their performance in relation to what the classical part of the simulations with NAMD require.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.