time, a Go program can open the driver library at runtime and bind the symbols it needs.
Conceptually:
Go binary
-> open libcuda.so.1
-> find cuInit
-> find cuDriverGetVersion
-> call CUDA Driver API
Enter fullscreen mode Exit fullscreen mode
The important part is that the binary can still be built like this:
CGO_ENABLED=0 go build ./...
Enter fullscreen mode Exit fullscreen mode
CUDA only needs to exist on the machine where the program actually runs.
Minimal example: initialize CUDA from Go
A small example could look like this:
package main
import (
"fmt"
"github.com/eitamring/gocudrv/cuda"
)
func main() {
driver, err := cuda.Open()
if err != nil {
panic(err)
}
defer driver.Close()
if err := driver.Init(0); err != nil {
panic(err)
}
version, err := driver.DriverVersion()
if err != nil {
panic(err)
}
fmt.Println("CUDA driver version:", version)
}
Enter fullscreen mode Exit fullscreen mode
Example output:
loading CUDA driver: libcuda.so.1
cuInit(0)
DriverVersion() = 12040
CUDA driver version: 12040
Enter fullscreen mode Exit fullscreen mode
This is not ML.
This is the foundation: initialize CUDA, call Driver API functions, then build up to memory management, PTX loading, and kernel launches.
Loading PTX
CUDA kernels can be compiled into PTX.
A simplified PTX kernel might look like this:
.version 7.8
.target sm_75
.address_size 64
.visible .entry vecAdd(
.param .u64 a,
.param .u64 b,
.param .u64 c,
.param .u32 n
)
{
ret;
}
Enter fullscreen mode Exit fullscreen mode
Go can load that PTX module at runtime:
module, err := ctx.LoadPTX(ptxBytes)
if err != nil {
panic(err)
}
kernel, err := module.Function("vecAdd")
if err != nil {
panic(err)
}
fmt.Println("PTX loaded successfully")
Enter fullscreen mode Exit fullscreen mode
Example output:
PTX loaded successfully
Enter fullscreen mode Exit fullscreen mode
Launching a CUDA kernel from Go
Once the PTX is loaded, Go can launch the kernel directly:
err = kernel.Launch(
cuda.GridDim{X: 1024, Y: 1, Z: 1},
cuda.BlockDim{X: 256, Y: 1, Z: 1},
0,
stream,
args,
)
if err != nil {
panic(err)
}
Enter fullscreen mode Exit fullscreen mode
Runtime logs might look like this:
using device 0: NVIDIA GeForce RTX 4090
loading PTX module...
PTX loaded successfully
kernel launch configuration: grid=1024 block=256
launch successful
execution completed
elapsed: 0.186 ms
Enter fullscreen mode Exit fullscreen mode
That is the core idea:
Go -> CUDA Driver API -> GPU
Enter fullscreen mode Exit fullscreen mode
No Python sidecar.
No cgo.
No build-time CUDA toolkit.
Example: vector addition
A simple starter workload is vector addition.
Input:
a = [1, 2, 3, 4]
b = [10, 20, 30, 40]
Enter fullscreen mode Exit fullscreen mode
Expected output:
c = [11, 22, 33, 44]
Enter fullscreen mode Exit fullscreen mode
The Go-side flow looks like this:
aDev, err := ctx.MemAlloc(size)
if err != nil {
panic(err)
}
defer aDev.Free()
bDev, err := ctx.MemAlloc(size)
if err != nil {
panic(err)
}
defer bDev.Free()
cDev, err := ctx.MemAlloc(size)
if err != nil {
panic(err)
}
defer cDev.Free()
if err := ctx.MemcpyHtoD(aDev, aHost); err != nil {
panic(err)
}
if err := ctx.MemcpyHtoD(bDev, bHost); err != nil {
panic(err)
}
err = kernel.Launch(
cuda.GridDim{X: blocks},
cuda.BlockDim{X: threads},
0,
stream,
[]cuda.KernelArg{
cuda.DevicePtrArg(aDev),
cuda.DevicePtrArg(bDev),
cuda.DevicePtrArg(cDev),
cuda.Uint32Arg(uint32(n)),
},
)
if err != nil {
panic(err)
}
if err := ctx.MemcpyDtoH(cHost, cDev); err != nil {
panic(err)
}
Enter fullscreen mode Exit fullscreen mode
This is intentionally low-level.
It gives Go access to the same CUDA primitives used elsewhere:
- allocate device memory
- copy host memory to device memory
- load a module
- launch a kernel
- copy device memory back to host memory
- synchronize
What this is useful for
This is not trying to replace PyTorch.
PyTorch is still the right tool for training models and high-level ML research.
The better use cases for CUDA from Go are infrastructure workloads:
- custom inference kernels
- vector search acceleration
- embeddings pipelines
- columnar data processing
- compression/decompression experiments
- image/video processing
- batch scoring
- database or analytics engine experiments
For example:
HTTP request
-> parse vectors in Go
-> copy batch to GPU memory
-> launch similarity kernel
-> copy result back
-> return response
Enter fullscreen mode Exit fullscreen mode
No Python worker required.
The tradeoff
Runtime-loading CUDA is not magic.
You still need:
- an NVIDIA GPU
- a compatible NVIDIA driver
- compiled PTX
- careful memory management
- enough batch size to justify GPU overhead
Small workloads can be slower than CPU code because you pay for:
- memory transfers
- kernel launch overhead
- synchronization
- device setup
A useful rule of thumb:
Tiny batch:
CPU wins
Large batch:
GPU may win
Repeated large batch:
GPU likely wins
Enter fullscreen mode Exit fullscreen mode
The goal is not “GPU everything.”
The goal is to make GPU access available to Go when it actually makes sense.
Why CGO_ENABLED=0 is the interesting part
The technical detail I care about most is not just “Go can call CUDA.”
It is this:
CGO_ENABLED=0 go build ./...
Enter fullscreen mode Exit fullscreen mode
That means you can build without:
- CUDA headers
- the CUDA toolkit
- a C compiler
- cgo
- build-time GPU dependencies
Then, at runtime:
try loading libcuda.so.1
if available:
use GPU path
else:
fall back to CPU path
Enter fullscreen mode Exit fullscreen mode
That fits infrastructure software much better.
The binary does not need to be built on a GPU machine. It only needs the NVIDIA driver on the machine where GPU execution actually happens.
Final thought
AI made GPUs mainstream, but GPUs are not only for AI.
They are throughput machines.
They are useful when you have large batches of repetitive work: matrix math, vector math, scans, filters, hashing, encoding, simulation, image processing, or data conversion.
Go already owns a lot of backend infrastructure.
So it makes sense for Go to have better access to accelerated computing.
Not through a Python sidecar.
Not through a fragile cgo setup.
But through a simple Go API that can load the CUDA Driver API when available.
That is the experiment behind https://github.com/eitamring/gocudrv