A benchmark scoring methodology specifies how raw model outputs are converted into quantitative scores that drive leaderboards and purchasing decisions. Without a transparent method, headline numbers are meaningless.
Scoring Pipeline Components
Common Metrics by Task Type
Design Trade-offs
- Strict exact-match scoring punishes semantically equivalent paraphrases but provides clear ranking.
- Learned metrics (BERTScore) capture semantics but can be gamed by adversarial synonyms.
- Weighting improves representativeness yet introduces subjective choices.
Current Trends (2025)
- Multi-reference datasets reduce false negatives in EM evaluation.
- Leaderboards publish 95 % confidence intervals to discourage over-fitting to noise.
- Open-source eval harnesses (HELM, lm-eval-v2) standardize preprocessing to curb "benchmark leakage."
Implementation Tips
- Freeze the evaluation container (hash of scripts + dataset) so scores are reproducible.
- Report both mean and median to reveal outliers.
- Use stratified bootstrap to compute per-domain CIs.
- Document any manual overrides (e.g., regex cleaning) in the benchmark card.