Illegal mem access

chiendarret · 01-16-2022, 05:27 AM

Though a recurring question, I have a specific question. The error below occurs a debian 11 box six cores two gpus while attempting to run code that requires access to the gpus (which were permanently activated)

free
total used free shared buff/cache available
Mem: 16309048 1015008 14203664 9632 1090376 14991136
Swap: 39059452 0 39059452

I could provide the output of lshw or other

FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 53 times over 0.005572 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 53 times over 0.005572 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered

SPECIFIC QUESTION: This is a BIOS dependent debian installition. As far as I can understand, memtest does no more work from disk. Installed here, it seems to require an uefi installation. If true, could you please suggest a turnaround to check the integrity of memory?

thanks

Brains · 01-16-2022, 05:53 AM

Quote:

(which were permanently activated)

PCI:0:1:0:0: is typically the address of a single NVidia graphics device.
PCI:0:0:2:0: is typically the address of the integrated Intel graphics.

To the best of my understanding, with newer hardware Nvidia graphics devices are supplements to the integrated graphics. Everything that reaches the monitor came through the Intel chip. Cuda is an NVidia thing.

Quote:

FATAL ERROR: CUDA
Pe 2 (gig64 device 0 pci 0:2:0)
Pe 4 (gig64 device 1 pci 0:3:0)

Looks to me like there's two NVidia devices trying to send data through two Intel graphics devices...
Because those addresses would be Intel addresses.
EDIT: Do they exist? the first one should.

pan64 · 01-16-2022, 07:43 AM

illegal memory access is a software error, not a hardware (memory) problem.

chiendarret · 01-16-2022, 10:28 AM

Upgrading from debian 10 to 11 was carried out a couple of months ago. Which is probably when software problems were introduced. They came out only today because I usually use a remote cluster from my vintage laptop rather than this box.

Could you please suggest what should be best done? Or the info I should provide?

thanks a lot

pan64 · 01-16-2022, 10:36 AM

that was post #2, looks like video driver related issue.

rtmistler · 01-16-2022, 10:47 AM

CUDA is NVidia's driver.

You could try an older version or newer version of their driver.

You could download the matching source, build it, and try to debug.

I'd see what version Debian 10 uses and try to downgrade to that one.

If it's the same version, then it would seem that there's interaction issues with a newer OS.

chiendarret · 01-16-2022, 01:11 PM

I tried with
dpkg-reconfigure nvidia-kernel-dkms

which deleted current 460.91.03 but reinstalled the same version, finishing with

DKMS install completed

at no avail, the box is very slowly responding as before.I can only agree with post #6 but the problem for me is finding the time. Perhaps faster to reinstall the system on non efi board

thanks a lot

chiendarret · 01-16-2022, 01:30 PM

I mean that for kernel 5.10.0-10-amd64 it could only install that nvidia driver. Before this issue, I found that debian takes care to avoid any such mismatch. For me it was easier in the early 90 with debian.

pan64 · 01-17-2022, 01:17 AM

Quote:

Originally Posted by chiendarret

Perhaps faster to reinstall the system on non efi board

I don't think it is related. I would still try to install another kernel and/or video driver.

chiendarret · 01-17-2022, 04:11 AM

I have now carried out a number of checks

nvidia-detect and nvidia-smi detect both GTX 680 as permanently activated, tell that they are supported by all nvidia drivers and that legacy drivers are only needed for cards from 400 downwards

I have also run nvidia-xconfig and 'dpkg-reconfigure -p low xserver-xorg'

The box boots quickly and everything flows rapidly, while trying a simulation with NAMD the same illegal access to mem occurs.

I have now submitted the problem to both Debian and NAMD forums because I should not be alone if the problem merely arises from software. From the web: old posts about problems with NAMD12 and nvidia, as well as for Debian 11 and nvidia, without details and, more importantly, solutions

Should that not help, I'll probably decide to remove debian 11 and install 10, if available, although with this raid1 box it requires attention, particularly by people that, like me, do system maintaining occasionally.

chiendarret · 01-20-2022, 04:31 AM

On further thinking, it is well possible that the issues that I described arise specifically for the molecular dynamics simulations with NAMD. Such simulations run correctly (albeit much too slow) on my vintage sony vaio with the same debian 11, and also (rapidly but costly) on the remote cluster.

On these basis, I am now testing the GPUs with glxgears, observing "301 frames in 5.o seconds= 60.005 FPS"

Does that suggest anything?
Unfortunately glmark2 is not in the repositories of debian. Instead of building it from source, to spare time could you suggest anothert tool for benchmarking the GPUs on linux?
thanks

chiendarret · 01-20-2022, 11:33 AM

Today new kernel 5.10.0-11-amd64, everything rebuilt while upgrading.
However, on trying the same simulation as before with namd, same error message

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 10 times over 0.001123 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 10 times over 0.001123 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered

chiendarret · 01-20-2022, 11:48 AM

While the nvidia driver remaine 460.91.03 even after reconfiguration.

computersavvy · 01-20-2022, 07:57 PM

Depending upon where you installed the nvidia driver from, its version is not likely to change unless you manually update it. In my experience, the driver downloaded from the nvidia site always has to be manually updated for each changed kernel and each driver version change. Thus if you downloaded and installed the 460.xx driver then unless you download the later driver (470.xx) you will always have the 460.xx driver on your system. This is true even if you recompile/reinstall it with the updated kernel.

The cuda driver also has to match the installed nvidia GPU driver since they work together.

chiendarret · 01-22-2022, 02:50 AM

Thanks for your comments. Since all this procedure has been automated by debian, I stopped downloading the nvidia driver and relied on debian to do that.

My interest in this two-680 box is limited because the suite (NAMD)that I use for computations has been only partly ported to GPUs. Forces are still computed by the CPUs. This means that when quantum mechanics comes into play, the only six cores of this box are insufficient (the minimum for the system I am investigating is 48 cores of a single node of the remote cluster.

Nonetheless, this box was useful in preparing the system for the remote cluster, as only classical mechanics come into play there. Before trying with new driver/kernel, to spare time I would like to carry out the most serious stressing of the GPUs that is today possible with linux. The test that I posted a few days ago (glxgears, observing "301 frames in 5.o seconds= 60.005 FPS") probably only tells that the GPUs are there and do their minimum service, but hardly more about their performance in relation to what the classical part of the simulations with NAMD require.

Could you please suggest one such test?
thanks