Hello
I am here again for a problem that I presented months ago. Now I have tried to isolate each GPU, as illustrated below. The problem remains and I had no clues from the user-portal of the software (NAMD) used for these molecular dynamics (MD) simulations.
My computer main board GA-X79-UD3 with two 680 GPUs and
Debian10 Linux,
$ uname -r
5.10.0-19-amd64
ADDENDUM
After that I cleaned the inside of the computer from the little dust, removed both GPUS and discarded the one that was above, setting at its place the one that was initially below. With only this GPU, same error. On the other hand, it was already clear that this is a software error.
Thanks for considering this issue
fp
CUDA driver version: 470.141.03 CUDA Version: 11.4
Software for MD: NAMD_Git-2022-07-21_Linux-x86_64-multicore-CUDA
can't any more run namd-CUDA using the same commands that were OK one month ago. In the meantime, new Linux kernels and CUDA versions did not solve the issue.
How MDs are launched:
Preceded by:
nvidia-smi -pm 1 to make GPUs persistent
Error using both CPUs:
command to run MD: namd2 +idlepoll +p12 +devices 0,1 min.conf
reported error:
Quote:
TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
|
Error using GPU 0:
namd2 +idlepoll +p12 +devices 0 min.conf
Quote:
TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 673 times over 0.077770 s on Pe 8 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
|
Error using GPU 1:
namd2 +idlepoll +p12 +devices 1 min.conf
Quote:
TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function sortTileLists, line 1577
on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 671 times over 0.077836 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
|
Any GPU hardware failure (memory) seems to me unlikely because both GPUs report the same error.
However, I was unable to trace the origin of the error.
Thanks for advice
francesco pietra