Temperature sampling controls the randomness of language model decoding by scaling the logits vector before applying softmax. A higher temperature produces more diverse, creative text, while a lower temperature yields deterministic, conservative responses.
Definition and Scope
Given logits (z) and temperature (T), probabilities are computed as softmax(z / T). Setting T=0 produces greedy decoding, while very high T flattens the distribution toward uniform.
Typical Settings
Interaction with Other Decoding Parameters
- Top-p (nucleus) sampling: Temperature applies before the cumulative probability cutoff; adjusting both simultaneously can double-count randomness.
- Repetition penalty: Lower temperatures may require higher penalties to avoid loops.
- Beam search: Effective temperature tends toward zero as beam width increases.
Performance Metrics
Design Trade-offs
- Low temperature reduces hallucinations but can produce bland, repetitive answers.
- High temperature improves diversity but may break formatting constraints (JSON, SQL).
- Temperature interacts with model size; smaller models need lower T to stay on-task.
Current Trends (2025)
- Adaptive Temperature: Serving stacks adjust T based on model estimated uncertainty, cutting hallucination rate 18 %.
- Per-token Temperature: Apply higher T early in generation and anneal toward 0.2 for closing sentences.
- User-controllable Creativity Knobs: Front-end sliders map to temperature and top-p presets.
Implementation Tips
- Never expose raw temperature values to end-users; abstract into "creative", "balanced", "precise" presets.
- Log chosen T alongside prompts to debug stochastic failures.
- Combine T=0 with a beam width of 5 to recover deterministic yet exploration-aware outputs for code synthesis.