The model context window is the maximum number of tokens a transformer can attend to in one forward pass.
Typical Sizes (2025)
Memory Scaling
Design Trade-offs
- Larger windows allow long docs but increase quadratic attention cost.
- Sparse & RoPE scaling techniques mitigate cost but may degrade locality.
- Very long windows still vulnerable to lost-in-the-middle.
Current Trends (2025)
- Flash Attention v3 drops memory to linear for 256 k.
- Hierarchical position encodings enable 1 M context with <2 % quality loss.
- Vendors bill by used tokens, not max window.
Implementation Tips
- Split docs into sections and retrieve relevant chunks instead of always maxing window.
- Monitor GPU memory headroom; context spills can OOM.
- Tune sliding window size for streaming generation.