GPU Utilization

GPU utilization refers to the fraction of compute, memory, and interconnect resources on a graphics processing unit that are actively used while serving or training AI workloads. Sustained high utilization lowers cost per token, whereas under-utilized GPUs waste expensive accelerator time.

Definition and Scope

Most vendors report three orthogonal utilization signals:

SM (Streaming Multiprocessor) busy percentage
Memory controller busy percentage
Tensor core / FP unit occupancy A balanced workload keeps all three ≥70 %.

Typical Utilization by Workload

Workload	SM Util	Memory Util	Notes
LLM Inference (batch=1)	25–40 %	15–30 %	Latency-optimized, kernel launch gaps
LLM Inference (batch=8)	60–75 %	45–60 %	Good balance on A100
LLM Training	90–95 %	60–85 %	Gradient accumulation hides latency
Computer Vision CNN	70–85 %	40–60 %	Smaller kernels, cache-friendly

Measurement Tools

nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
NVIDIA DCGM or AMD ROCm SMI for per-process metrics.
Prometheus exporters for long-term retention.

Optimization Techniques

Increase batch size until latency SLA or memory cap reached.
Fuse small kernels with TensorRT or TorchScript.
Overlap data transfers with compute using CUDA streams.
Place model weights in persistent GPU memory to avoid PCIe thrash.

Design Trade-offs

Larger batches raise throughput but can increase p99 latency.
Kernel fusion reduces flexibility for dynamic shapes.
Multi-instance GPU (MIG) partitions improve isolation yet cap peak SM counts.

Current Trends (2025)

Scheduler-aware libraries (e.g., TorchServe v3) auto-merge tenant requests for ≥85 % utilization.
Fine-grained GPU time-slicing via NVIDIA MPS 3.0 reduces context-switch overhead to 5 µs.
Vendor APIs expose HBM bandwidth counters, enabling mixed-precision planners that match memory pressure to FP8 tensor cores.

Implementation Tips

Track utilization alongside power draw; efficiency (images/J or tokens/J) is a better KPI.
Alert when SM busy <30 % for 10 min—probably indicates stalled job or dead model.
Use NCCL p2pLinkBandwidth metric to verify multi-GPU links are saturated during distributed training.

Command Palette