Temperature Sampling

Temperature sampling controls the randomness of language model decoding by scaling the logits vector before applying softmax. A higher temperature produces more diverse, creative text, while a lower temperature yields deterministic, conservative responses.

Definition and Scope

Given logits (z) and temperature (T), probabilities are computed as softmax(z / T). Setting T=0 produces greedy decoding, while very high T flattens the distribution toward uniform.

Typical Settings

Use Case	Temperature	Notes
Fact extraction, code generation	0.0–0.2	Maximizes reproducibility
Conversational assistant	0.5–0.7	Balances creativity and accuracy
Brainstorming, fiction	0.8–1.2	Encourages novelty, watch for incoherence

Interaction with Other Decoding Parameters

Top-p (nucleus) sampling: Temperature applies before the cumulative probability cutoff; adjusting both simultaneously can double-count randomness.
Repetition penalty: Lower temperatures may require higher penalties to avoid loops.
Beam search: Effective temperature tends toward zero as beam width increases.

Performance Metrics

Metric	Observation
Entropy per token	Increases roughly linearly with T up to 1.0¹
Pass-key accuracy	Drops sharply above T=0.7 on logic tasks²
User satisfaction (chat)	Peaks near T=0.6 in A/B tests³

Design Trade-offs

Low temperature reduces hallucinations but can produce bland, repetitive answers.
High temperature improves diversity but may break formatting constraints (JSON, SQL).
Temperature interacts with model size; smaller models need lower T to stay on-task.

Current Trends (2025)

Adaptive Temperature: Serving stacks adjust T based on model estimated uncertainty, cutting hallucination rate 18 %.
Per-token Temperature: Apply higher T early in generation and anneal toward 0.2 for closing sentences.
User-controllable Creativity Knobs: Front-end sliders map to temperature and top-p presets.

Implementation Tips

Never expose raw temperature values to end-users; abstract into "creative", "balanced", "precise" presets.
Log chosen T alongside prompts to debug stochastic failures.
Combine T=0 with a beam width of 5 to recover deterministic yet exploration-aware outputs for code synthesis.

Gopalan et al., An Information-Theoretic Analysis of Sampling Temperature in LMs, 2024. ↩
Internal OpenAI reasoning benchmark, 2023. ↩
Anthropic Assistant user study, 2024. ↩

Command Palette