Few-Shot Learning

Few-shot learning (FSL) refers to a model's ability to adapt to new tasks or classes given only a handful (usually 1-32) of annotated examples. In large language models this commonly takes the form of providing several input-output demonstrations directly inside the prompt so the model can infer the task pattern without parameter updates.

Definition and Scope

The "shots" in few-shot denote the count of task examples available at inference time. Few-shot prompt engineering differs from zero-shot prompting (no examples) and fine-tuning (weight updates). In computer vision FSL may involve meta-learning algorithms such as MAML that update a small classifier head; in LLMs, context-based in-prompt learning dominates.

Mechanisms Enabling FSL in LLMs

In-context learning: the transformer treats demonstrations as long-range dependencies and learns a new mapping on the fly.
Attention reuse: keys and values from demo pairs bias decoder states toward analogous outputs.
Gradient-free adaptation: no weight change is needed; capacity scales with context window.

Effect of Shot Count on Accuracy

Task	0-shot	4-shot	16-shot	32-shot
GSM-8K (math)	22 %	42 %	55 %	61 %
MMLU (knowledge)	45 %	63 %	70 %	72 %
Big-Bench Hard	17 %	34 %	46 %	50 %

Design Trade-offs

Prompt Length: More shots boost accuracy but consume context tokens and increase latency.
Example Selection: Diverse, prototypical examples generalize better than random ones.
Order Effects: Place harder or longer examples last to stay within recency limits.

Current Trends (2025)

Automatic Example Retrieval: Embedding search selects top-k demos per user query.
Synthetic Few-Shot Data: Models like Self-Instruct generate their own demonstrations offline, cutting human labeling costs.
Contrastive Decoding: Combine few-shot prompts with a smaller control model to reduce hallucinations.

Implementation Tips

Cache tokenized demonstrations to avoid recomputing embeddings.
Remove personally identifiable info from demos to prevent leakage.
Use system messages to describe the schema, then supply demos; models follow consistency cues.