LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 01-16-2022, 05:27 AM   #1
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Rep: Reputation: 16
Illegal mem access


Though a recurring question, I have a specific question. The error below occurs a debian 11 box six cores two gpus while attempting to run code that requires access to the gpus (which were permanently activated)

free
total used free shared buff/cache available
Mem: 16309048 1015008 14203664 9632 1090376 14991136
Swap: 39059452 0 39059452


I could provide the output of lshw or other


FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 53 times over 0.005572 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 53 times over 0.005572 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered

SPECIFIC QUESTION: This is a BIOS dependent debian installition. As far as I can understand, memtest does no more work from disk. Installed here, it seems to require an uefi installation. If true, could you please suggest a turnaround to check the integrity of memory?

thanks
 
Old 01-16-2022, 05:53 AM   #2
Brains
Senior Member
 
Registered: Apr 2009
Distribution: All OS except Apple
Posts: 1,591

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
Quote:
(which were permanently activated)
PCI:0:1:0:0: is typically the address of a single NVidia graphics device.
PCI:0:0:2:0: is typically the address of the integrated Intel graphics.

To the best of my understanding, with newer hardware Nvidia graphics devices are supplements to the integrated graphics. Everything that reaches the monitor came through the Intel chip. Cuda is an NVidia thing.
Quote:
FATAL ERROR: CUDA
Pe 2 (gig64 device 0 pci 0:2:0)
Pe 4 (gig64 device 1 pci 0:3:0)
Looks to me like there's two NVidia devices trying to send data through two Intel graphics devices...
Because those addresses would be Intel addresses.
EDIT: Do they exist? the first one should.

Last edited by Brains; 01-16-2022 at 05:55 AM.
 
1 members found this post helpful.
Old 01-16-2022, 07:43 AM   #3
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,850

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
illegal memory access is a software error, not a hardware (memory) problem.
 
1 members found this post helpful.
Old 01-16-2022, 10:28 AM   #4
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
Upgrading from debian 10 to 11 was carried out a couple of months ago. Which is probably when software problems were introduced. They came out only today because I usually use a remote cluster from my vintage laptop rather than this box.

Could you please suggest what should be best done? Or the info I should provide?

thanks a lot
 
Old 01-16-2022, 10:36 AM   #5
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,850

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
that was post #2, looks like video driver related issue.
 
Old 01-16-2022, 10:47 AM   #6
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,882
Blog Entries: 13

Rep: Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930Reputation: 4930
CUDA is NVidia's driver.

You could try an older version or newer version of their driver.

You could download the matching source, build it, and try to debug.

I'd see what version Debian 10 uses and try to downgrade to that one.

If it's the same version, then it would seem that there's interaction issues with a newer OS.
 
1 members found this post helpful.
Old 01-16-2022, 01:11 PM   #7
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
I tried with
dpkg-reconfigure nvidia-kernel-dkms

which deleted current 460.91.03 but reinstalled the same version, finishing with

DKMS install completed

at no avail, the box is very slowly responding as before.I can only agree with post #6 but the problem for me is finding the time. Perhaps faster to reinstall the system on non efi board

thanks a lot
 
Old 01-16-2022, 01:30 PM   #8
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
I mean that for kernel 5.10.0-10-amd64 it could only install that nvidia driver. Before this issue, I found that debian takes care to avoid any such mismatch. For me it was easier in the early 90 with debian.
 
Old 01-17-2022, 01:17 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,850

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
Quote:
Originally Posted by chiendarret View Post
Perhaps faster to reinstall the system on non efi board
I don't think it is related. I would still try to install another kernel and/or video driver.
 
Old 01-17-2022, 04:11 AM   #10
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
I have now carried out a number of checks

nvidia-detect and nvidia-smi detect both GTX 680 as permanently activated, tell that they are supported by all nvidia drivers and that legacy drivers are only needed for cards from 400 downwards

I have also run nvidia-xconfig and 'dpkg-reconfigure -p low xserver-xorg'

The box boots quickly and everything flows rapidly, while trying a simulation with NAMD the same illegal access to mem occurs.

I have now submitted the problem to both Debian and NAMD forums because I should not be alone if the problem merely arises from software. From the web: old posts about problems with NAMD12 and nvidia, as well as for Debian 11 and nvidia, without details and, more importantly, solutions

Should that not help, I'll probably decide to remove debian 11 and install 10, if available, although with this raid1 box it requires attention, particularly by people that, like me, do system maintaining occasionally.
 
Old 01-20-2022, 04:31 AM   #11
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
On further thinking, it is well possible that the issues that I described arise specifically for the molecular dynamics simulations with NAMD. Such simulations run correctly (albeit much too slow) on my vintage sony vaio with the same debian 11, and also (rapidly but costly) on the remote cluster.

On these basis, I am now testing the GPUs with glxgears, observing "301 frames in 5.o seconds= 60.005 FPS"

Does that suggest anything?
Unfortunately glmark2 is not in the repositories of debian. Instead of building it from source, to spare time could you suggest anothert tool for benchmarking the GPUs on linux?
thanks
 
Old 01-20-2022, 11:33 AM   #12
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
Today new kernel 5.10.0-11-amd64, everything rebuilt while upgrading.
However, on trying the same simulation as before with namd, same error message

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 2 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 10 times over 0.001123 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 10 times over 0.001123 s on Pe 4 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
 
Old 01-20-2022, 11:48 AM   #13
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
While the nvidia driver remaine 460.91.03 even after reconfiguration.
 
Old 01-20-2022, 07:57 PM   #14
computersavvy
Senior Member
 
Registered: Aug 2016
Posts: 3,345

Rep: Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484Reputation: 1484
Depending upon where you installed the nvidia driver from, its version is not likely to change unless you manually update it. In my experience, the driver downloaded from the nvidia site always has to be manually updated for each changed kernel and each driver version change. Thus if you downloaded and installed the 460.xx driver then unless you download the later driver (470.xx) you will always have the 460.xx driver on your system. This is true even if you recompile/reinstall it with the updated kernel.

The cuda driver also has to match the installed nvidia GPU driver since they work together.
 
1 members found this post helpful.
Old 01-22-2022, 02:50 AM   #15
chiendarret
Member
 
Registered: Mar 2007
Posts: 307

Original Poster
Rep: Reputation: 16
Thanks for your comments. Since all this procedure has been automated by debian, I stopped downloading the nvidia driver and relied on debian to do that.

My interest in this two-680 box is limited because the suite (NAMD)that I use for computations has been only partly ported to GPUs. Forces are still computed by the CPUs. This means that when quantum mechanics comes into play, the only six cores of this box are insufficient (the minimum for the system I am investigating is 48 cores of a single node of the remote cluster.

Nonetheless, this box was useful in preparing the system for the remote cluster, as only classical mechanics come into play there. Before trying with new driver/kernel, to spare time I would like to carry out the most serious stressing of the GPUs that is today possible with linux. The test that I posted a few days ago (glxgears, observing "301 frames in 5.o seconds= 60.005 FPS") probably only tells that the GPUs are there and do their minimum service, but hardly more about their performance in relation to what the classical part of the simulations with NAMD require.

Could you please suggest one such test?
thanks
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
While installing eXist 4 I encountered an illegal reflective access error. speedlearner Linux - Software 3 06-05-2018 08:53 AM
FreeCAD Illegal storage access Linux.tar.gz Linux - Software 6 09-26-2010 01:41 PM
Mem Questions?? CragStar Linux - General 2 08-06-2001 03:34 PM
boot of kernel 2.4.0-ac10 stops after releasing mem used by kernel zielot Linux - Software 0 01-26-2001 06:30 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 09:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration