LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   CUDA error cudaStreamSynchronize(stream) and CUDA error in ComputeBondedCUDA (https://www.linuxquestions.org/questions/linux-software-2/cuda-error-cudastreamsynchronize-stream-and-cuda-error-in-computebondedcuda-4175718883/)

chiendarret 11-18-2022 03:56 AM

CUDA error cudaStreamSynchronize(stream) and CUDA error in ComputeBondedCUDA
 
Hello
I am here again for a problem that I presented months ago. Now I have tried to isolate each GPU, as illustrated below. The problem remains and I had no clues from the user-portal of the software (NAMD) used for these molecular dynamics (MD) simulations.

My computer main board GA-X79-UD3 with two 680 GPUs and

Debian10 Linux,
$ uname -r
5.10.0-19-amd64


ADDENDUM
After that I cleaned the inside of the computer from the little dust, removed both GPUS and discarded the one that was above, setting at its place the one that was initially below. With only this GPU, same error. On the other hand, it was already clear that this is a software error.
Thanks for considering this issue
fp
CUDA driver version: 470.141.03 CUDA Version: 11.4

Software for MD: NAMD_Git-2022-07-21_Linux-x86_64-multicore-CUDA

can't any more run namd-CUDA using the same commands that were OK one month ago. In the meantime, new Linux kernels and CUDA versions did not solve the issue.

How MDs are launched:

Preceded by:
nvidia-smi -pm 1 to make GPUs persistent

Error using both CPUs:

command to run MD: namd2 +idlepoll +p12 +devices 0,1 min.conf
reported error:

Quote:

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
Error using GPU 0:

namd2 +idlepoll +p12 +devices 0 min.conf

Quote:

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
Error using GPU 1:

namd2 +idlepoll +p12 +devices 1 min.conf

Quote:

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
Any GPU hardware failure (memory) seems to me unlikely because both GPUs report the same error.
However, I was unable to trace the origin of the error.
Thanks for advice
francesco pietra


All times are GMT -5. The time now is 04:05 AM.