GPU utilization refers to the fraction of compute, memory, and interconnect resources on a graphics processing unit that are actively used while serving or training AI workloads. Sustained high utilization lowers cost per token, whereas under-utilized GPUs waste expensive accelerator time.
Definition and Scope
Most vendors report three orthogonal utilization signals:
- SM (Streaming Multiprocessor) busy percentage
- Memory controller busy percentage
- Tensor core / FP unit occupancy A balanced workload keeps all three ≥70 %.
Typical Utilization by Workload
Measurement Tools
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
- NVIDIA DCGM or AMD ROCm SMI for per-process metrics.
- Prometheus exporters for long-term retention.
Optimization Techniques
- Increase batch size until latency SLA or memory cap reached.
- Fuse small kernels with TensorRT or TorchScript.
- Overlap data transfers with compute using CUDA streams.
- Place model weights in persistent GPU memory to avoid PCIe thrash.
Design Trade-offs
- Larger batches raise throughput but can increase p99 latency.
- Kernel fusion reduces flexibility for dynamic shapes.
- Multi-instance GPU (MIG) partitions improve isolation yet cap peak SM counts.
Current Trends (2025)
- Scheduler-aware libraries (e.g., TorchServe v3) auto-merge tenant requests for ≥85 % utilization.
- Fine-grained GPU time-slicing via NVIDIA MPS 3.0 reduces context-switch overhead to 5 µs.
- Vendor APIs expose HBM bandwidth counters, enabling mixed-precision planners that match memory pressure to FP8 tensor cores.
Implementation Tips
- Track utilization alongside power draw; efficiency (images/J or tokens/J) is a better KPI.
- Alert when SM busy <30 % for 10 min—probably indicates stalled job or dead model.
- Use NCCL p2pLinkBandwidth metric to verify multi-GPU links are saturated during distributed training.