Gorgonia comes with CUDA support out of the box. However, usage is specialized.
To use CUDA, you must build your application with the build tag
cuda, like so:
go build -tags='cuda' .
Furthermore, there are some additional requirements:
nvcccompiler which is required to run your code with CUDA (Be sure to follow the post-installation steps).
go install gorgonia.org/gorgonia/cmd/cudagen. This installs the
cudagenwill generate the relevant CUDA related code for Gorgonia. Note that you will need a folder at
CUDA requires thread affinity, and therefore the OS thread must be locked.
runtime.LockOSThread() must be called in the main function where the VM is running. Please cf this wiki for a general information on how to handle this properly within your Go program
The main reasons for having such complicated requirements for using CUDA is quite simply performance related. As Dave Cheney famously wrote, cgo is not Go. To use CUDA, cgo is unfortunately required. And to use cgo, plenty of tradeoffs need to be made.
Therefore the solution was to nestle the CUDA related code in a build tag,
cuda. That way by default no cgo is used (well, kind-of - you could still use
The reason for requiring CUDA toolkit and the tool cudagen is because there are many CUDA Compute Capabilities, and generating code for them all would yield a huge binary for no real good reason. Rather, users are encouraged to compile for their specific Compute Capabilities.
The reason for requiring an explicit specification to use CUDA for which ops is due to the cost of cgo calls. Additional work is being done currently to implement batched cgo calls, but until that is done, the solution is keyhole “upgrade” of certain ops
Ops supported by CUDA
As of now, only the very basic simple ops support CUDA:
Elementwise unary operations:
inv(reciprocal of a number)
Elementwise binary operations - only arithmetic operations support CUDA:
From a lot of profiling of this author’s personal projects, the ones that really matter are
cube - basically the activation functions. The other operations do work fine with MKL+AVX and aren’t the major cause of slowness in a neural network
In a trivial benchmark, careful use of CUDA (in this case, used to call
sigmoid) shows impressive improvements over non-CUDA code (bearing in mind the CUDA kernel is extremely naive and not optimized):
BenchmarkOneMilCUDA-8 300 3348711 ns/op BenchmarkOneMil-8 50 33169036 ns/op
see this tutorial for a complete example