Model Intelligence Score

Benched.ai Editorial Team

A model intelligence score is a composite metric designed to summarize overall model capability across multiple benchmarks (MMLU, GSM8K, MBPP, etc.).

Example Weighting Scheme

Benchmark	Weight
MMLU	0.4
GSM8K	0.3
HumanEval	0.2
Winogrande	0.1

Overall score = Σ weight × normalized benchmark.

Design Trade-offs

Simple weighted average easy to communicate but hides domain weaknesses.
Too many benchmarks dilutes signal.
Proprietary weights reduce transparency.

Current Trends (2025)

Adaptive weights adjust per user domain (code vs chat).
Leaderboards publish raw scores alongside composite.
Bootstrapped CIs show ±2 % noise intervals.

Implementation Tips

Normalize each benchmark to 0–100 before weighting.
Publish YAML of weight config for reproducibility.
Update weights annually to reflect task importance shifts.