A model intelligence score is a composite metric designed to summarize overall model capability across multiple benchmarks (MMLU, GSM8K, MBPP, etc.).
Example Weighting Scheme
Overall score = Σ weight × normalized benchmark.
Design Trade-offs
- Simple weighted average easy to communicate but hides domain weaknesses.
- Too many benchmarks dilutes signal.
- Proprietary weights reduce transparency.
Current Trends (2025)
- Adaptive weights adjust per user domain (code vs chat).
- Leaderboards publish raw scores alongside composite.
- Bootstrapped CIs show ±2 % noise intervals.
Implementation Tips
- Normalize each benchmark to 0–100 before weighting.
- Publish YAML of weight config for reproducibility.
- Update weights annually to reflect task importance shifts.