Model Context Window

Benched.ai Editorial Team

The model context window is the maximum number of tokens a transformer can attend to in one forward pass.

Typical Sizes (2025)

Model	Window
GPT-4o	128 k
Claude 3	200 k
Gemini 1.5 Pro	1 M
Mistral Medium	32 k

Memory Scaling

Window	VRAM for KV (7B)
8 k	3 GB
32 k	12 GB
128 k	48 GB
1 M	Offload required

Design Trade-offs

Larger windows allow long docs but increase quadratic attention cost.
Sparse & RoPE scaling techniques mitigate cost but may degrade locality.
Very long windows still vulnerable to lost-in-the-middle.

Current Trends (2025)

Flash Attention v3 drops memory to linear for 256 k.
Hierarchical position encodings enable 1 M context with <2 % quality loss.
Vendors bill by used tokens, not max window.

Implementation Tips

Split docs into sections and retrieve relevant chunks instead of always maxing window.
Monitor GPU memory headroom; context spills can OOM.
Tune sliding window size for streaming generation.