I need to install Nvidia and CUDA drivers onto an Oracle Cloud VM (RHLE7.7). This is to use with tensorflow 1.14.
The best install guide I could find was for RHLE7.6 (see Bob Kozdemba post):
https://access.redhat.com/discussions/3672301
Here are the steps from the post:
Code:
# yum -y update
# reboot
# yum -y install kernel-devel-$(uname -r) kernel-headers-$(uname -r) pciutils
# yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# yum -y install https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.0.130-1.x86_64.rpm
# yum clean all
# yum -y install cuda
I did a reboot after following those steps, but got the infamous "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running" error when I do nvidia-smi.
Here are the full server details:
Code:
Oracle Linux Server release 7.7
NAME="Oracle Linux Server"
VERSION="7.7"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.7"
PRETTY_NAME="Oracle Linux Server 7.7"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:7:7:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://bugzilla.oracle.com/"
ORACLE_BUGZILLA_PRODUCT="Oracle Linux 7"
ORACLE_BUGZILLA_PRODUCT_VERSION=7.7
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=7.7
Red Hat Enterprise Linux Server release 7.7 (Maipo)
Oracle Linux Server release 7.7
When I do:
Code:
lspci | grep -i nvidia
I get this:
Code:
00:04.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 SXM2 16GB] (rev a1)
When I do:
Code:
lshw -numeric -C display
I get this (note display:1 UNCLAIMED):
Code:
*-display:0
description: VGA compatible controller
product: [1234:1111]
vendor: [1234]
physical id: 2
bus info: pci@0000:00:02.0
version: 02
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master rom
configuration: driver=bochs-drm latency=0
resources: irq:0 memory:c0000000-c0ffffff memory:c2001000-c2001fff memory:c0000-dffff
*-display:1 UNCLAIMED
description: 3D controller
product: GP100GL [Tesla P100 SXM2 16GB] [10DE:15F9]
vendor: NVIDIA Corporation [10DE]
physical id: 4
bus info: pci@0000:00:04.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list
configuration: latency=0
resources: iomemory:200-1ff iomemory:240-23f memory:c1000000-c1ffffff memory:2000000000-23ffffffff memory:2400000000-2401ffffff
When I do:
I get this:
Code:
00:00.0 Host bridge [0600]: Intel Corporation 440FX - 82441FX PMC [Natoma] [8086:1237] (rev 02)
Subsystem: Red Hat, Inc. Qemu virtual machine [1af4:1100]
00:01.0 ISA bridge [0601]: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] [8086:7000]
Subsystem: Red Hat, Inc. Qemu virtual machine [1af4:1100]
00:01.1 IDE interface [0101]: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] [8086:7010]
Subsystem: Red Hat, Inc. Qemu virtual machine [1af4:1100]
Kernel driver in use: ata_piix
Kernel modules: ata_piix, pata_acpi, ata_generic
00:01.2 USB controller [0c03]: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] [8086:7020] (rev 01)
Subsystem: Red Hat, Inc. QEMU Virtual Machine [1af4:1100]
Kernel driver in use: uhci_hcd
00:01.3 Bridge [0680]: Intel Corporation 82371AB/EB/MB PIIX4 ACPI [8086:7113] (rev 03)
Subsystem: Red Hat, Inc. Qemu virtual machine [1af4:1100]
Kernel driver in use: piix4_smbus
Kernel modules: i2c_piix4
00:02.0 VGA compatible controller [0300]: Device [1234:1111] (rev 02)
Subsystem: Red Hat, Inc. Device [1af4:1100]
Kernel driver in use: bochs-drm
Kernel modules: bochs_drm
00:03.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme-E Ethernet Virtual Function [14e4:16dc]
Subsystem: Oracle/SUN Device [108e:16d7]
Kernel driver in use: bnxt_en
Kernel modules: bnxt_en
00:04.0 3D controller [0302]: NVIDIA Corporation GP100GL [Tesla P100 SXM2 16GB] [10de:15f9] (rev a1)
Subsystem: NVIDIA Corporation Device [10de:116b]
Kernel modules: nvidiafb, nouveau
00:05.0 SCSI storage controller [0100]: Red Hat, Inc. Virtio SCSI [1af4:1004]
Subsystem: Oracle/SUN Device [108e:0008]
Kernel driver in use: virtio-pci
Kernel modules: virtio_pci
I've seen some posts relating to the UNCLAIMED issue, and these mention to disable secure boot, but I'm not sure if that is possible (or present) on this VM.
Any help much appreciated, as I've hit a dead-end with this.