Context Compression

Benched.ai Editorial Team

Context compression reduces the token length of retrieved or user-supplied documents so they fit into a model's context window without sacrificing answer quality.

Compression Techniques

Technique	Lossiness	Typical Ratio	Tools
Extractive summarization	Lossy	2–5×	Llama-Index, T5-based
Abstractive summarization	Lossy	5–15×	GPT-4o, Mixtral-summarize
Sentence scoring (TextRank)	Lossy	1.5–3×	spaCy / NetworkX
Knowledge distillation	Lossless for key facts	Variable	QLoRA-compress
Token pruning (stopwords)	Near-lossless	1.2–1.5×	Custom regex

Quality vs Ratio Curve (sample)

Ratio	Answer F1 (QA)
1× (no compression)	87 %
3×	85 %
10×	78 %
20×	60 %

Design Trade-offs

Higher ratios save cost but may omit reasoning chains required for citations.
Abstractive approaches risk hallucinating details not present in source.
Extractive summaries preserve grounding but may miss implicit context.

Current Trends (2025)

Hierarchical compressors first prune at chunk level, then sentence level.
Semantic hashing enables constant-time deduplication before compression.
Benchmarks like HARDC reduce evaluation to a single "preserve answerability" metric.

Implementation Tips

Measure downstream task accuracy at each compression ratio to set safe limits.
Cache compressed chunks alongside embeddings to avoid recomputation.
Keep citation spans unchanged when regulatory compliance demands verbatim text.