Command Palette

Search for a command to run...

GPU Utilization

Benched.ai Editorial Team

GPU utilization refers to the fraction of compute, memory, and interconnect resources on a graphics processing unit that are actively used while serving or training AI workloads. Sustained high utilization lowers cost per token, whereas under-utilized GPUs waste expensive accelerator time.

  Definition and Scope

Most vendors report three orthogonal utilization signals:

  1. SM (Streaming Multiprocessor) busy percentage
  2. Memory controller busy percentage
  3. Tensor core / FP unit occupancy A balanced workload keeps all three ≥70 %.

  Typical Utilization by Workload

WorkloadSM UtilMemory UtilNotes
LLM Inference (batch=1)25–40 %15–30 %Latency-optimized, kernel launch gaps
LLM Inference (batch=8)60–75 %45–60 %Good balance on A100
LLM Training90–95 %60–85 %Gradient accumulation hides latency
Computer Vision CNN70–85 %40–60 %Smaller kernels, cache-friendly

  Measurement Tools

  • nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
  • NVIDIA DCGM or AMD ROCm SMI for per-process metrics.
  • Prometheus exporters for long-term retention.

  Optimization Techniques

  1. Increase batch size until latency SLA or memory cap reached.
  2. Fuse small kernels with TensorRT or TorchScript.
  3. Overlap data transfers with compute using CUDA streams.
  4. Place model weights in persistent GPU memory to avoid PCIe thrash.

  Design Trade-offs

  • Larger batches raise throughput but can increase p99 latency.
  • Kernel fusion reduces flexibility for dynamic shapes.
  • Multi-instance GPU (MIG) partitions improve isolation yet cap peak SM counts.

  Current Trends (2025)

  • Scheduler-aware libraries (e.g., TorchServe v3) auto-merge tenant requests for ≥85 % utilization.
  • Fine-grained GPU time-slicing via NVIDIA MPS 3.0 reduces context-switch overhead to 5 µs.
  • Vendor APIs expose HBM bandwidth counters, enabling mixed-precision planners that match memory pressure to FP8 tensor cores.

  Implementation Tips

  1. Track utilization alongside power draw; efficiency (images/J or tokens/J) is a better KPI.
  2. Alert when SM busy <30 % for 10 min—probably indicates stalled job or dead model.
  3. Use NCCL p2pLinkBandwidth metric to verify multi-GPU links are saturated during distributed training.