This document is a running list of troubleshooting TODOs. Should you run into issues with GPU usage, this document should help.
The cu
package ships with an application called cudatest
which will be helpful in troubleshooting issues.
To install cudatest
, run
go install gorgonia.org/cu/cmd/cudatest
This also assumes that you already have installed CUDA, and cuDNN.
If you are running multiple GPUs, you might run into a message that looks as follows:
Error in initialization, please refer to "https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__INITIALIZE.html"
This usually means that one of your GPUs does not support CUDA. You can still run with CUDA if you know at least one of your GPUs supports CUDA.
First, use nvidia-smi
to find the running GPUs. An example is provided below
Thu Jul 16 17:41:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K20Xm On | 00000000:06:00.0 Off | 0 |
| N/A 33C P8 16W / 235W | 0MiB / 5700MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GT 1030 On | 00000000:07:00.0 On | N/A |
| 35% 33C P0 N/A / 30W | 656MiB / 1994MiB | 51% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A XXXX G /usr/lib/xorg/Xorg 270MiB |
| 1 N/A N/A XXXX G /usr/bin/PROGRAMNAME 77MiB |
| 1 N/A N/A XXXX G /usr/bin/PROGRAMNAME 68MiB |
| 1 N/A N/A XXXX G ...AAAAAAAAA= --shared-files 221MiB |
+-----------------------------------------------------------------------------+
Here, we see that there are two GPUs:
The GeForce GT 1030 does not supoprt CUDA. While the Tesla K20Xm does. To remedy this, simply add this environment variable:
CUDA_VISIBLE_DEVICES=0 cudatest
Something like the following should be returned:
$ CUDA_VISIBLE_DEVICES=0 cudatest
CUDA version: 11000
CUDA devices: 1
Device 0
========
Name : "Tesla K20Xm"
Clock Rate: 732000 kHz
Memory : 5977800704 bytes
Compute : 3.5