Command Palette

Search for a command to run...

Temperature Sampling

Benched.ai Editorial Team

Temperature sampling controls the randomness of language model decoding by scaling the logits vector before applying softmax. A higher temperature produces more diverse, creative text, while a lower temperature yields deterministic, conservative responses.

  Definition and Scope

Given logits (z) and temperature (T), probabilities are computed as softmax(z / T). Setting T=0 produces greedy decoding, while very high T flattens the distribution toward uniform.

  Typical Settings

Use CaseTemperatureNotes
Fact extraction, code generation0.0–0.2Maximizes reproducibility
Conversational assistant0.5–0.7Balances creativity and accuracy
Brainstorming, fiction0.8–1.2Encourages novelty, watch for incoherence

  Interaction with Other Decoding Parameters

  • Top-p (nucleus) sampling: Temperature applies before the cumulative probability cutoff; adjusting both simultaneously can double-count randomness.
  • Repetition penalty: Lower temperatures may require higher penalties to avoid loops.
  • Beam search: Effective temperature tends toward zero as beam width increases.

  Performance Metrics

MetricObservation
Entropy per tokenIncreases roughly linearly with T up to 1.01
Pass-key accuracyDrops sharply above T=0.7 on logic tasks2
User satisfaction (chat)Peaks near T=0.6 in A/B tests3

  Design Trade-offs

  • Low temperature reduces hallucinations but can produce bland, repetitive answers.
  • High temperature improves diversity but may break formatting constraints (JSON, SQL).
  • Temperature interacts with model size; smaller models need lower T to stay on-task.

  Current Trends (2025)

  • Adaptive Temperature: Serving stacks adjust T based on model estimated uncertainty, cutting hallucination rate 18 %.
  • Per-token Temperature: Apply higher T early in generation and anneal toward 0.2 for closing sentences.
  • User-controllable Creativity Knobs: Front-end sliders map to temperature and top-p presets.

  Implementation Tips

  1. Never expose raw temperature values to end-users; abstract into "creative", "balanced", "precise" presets.
  2. Log chosen T alongside prompts to debug stochastic failures.
  3. Combine T=0 with a beam width of 5 to recover deterministic yet exploration-aware outputs for code synthesis.

  References

  1. Gopalan et al., An Information-Theoretic Analysis of Sampling Temperature in LMs, 2024.

  2. Internal OpenAI reasoning benchmark, 2023.

  3. Anthropic Assistant user study, 2024.