Zero-Shot Learning

Zero-shot learning (ZSL) refers to a model's ability to generalize to tasks, classes, or domains that were never explicitly seen during training, relying only on natural language descriptions or other metadata. In large language models the term often describes prompting the model with an instruction but without any in-context examples.

Definition and Scope

Traditional ZSL emerged in computer vision where attributes such as "striped" or "four-legged" let a classifier recognize new species. In LLMs, zero-shot generally means answering with no task-specific fine-tuning and no few-shot demonstrations, although system messages and high-level instructions are permitted.

Mechanisms Enabling ZSL

Massive pre-training corpora provide latent coverage of many tasks.
Textual descriptions act as weak labels mapping unseen classes to shared semantic space.
The attention mechanism supports compositional generalization (e.g., "translate Klingon to French").

Quality Benchmarks

Benchmark	Domain	Typical Zero-Shot Score
MMLU¹	Knowledge QA	GPT-4: 86 %, Mixtral 8x22B: 78 %
GSM-8K	Math	GPT-4: 53 %, Llama-2-70B: 16 %
BIG-Bench Hard	Mixed	GPT-4: 83 %, Claude-3 Opus: 79 %

Design Trade-offs

Prompt Specificity: Elaborate instructions improve accuracy but add token overhead.
Temperature: Higher temperature can explore solution space but risks off-task answers.
Latent Bias: Without examples, demographic or domain bias may surface more strongly.

Current Trends (2025)

Tool-Augmented ZSL: ReACT-style agents combine retrieval and code execution on zero-shot tasks.
Multi-modal ZSL: MLLMs infer image classifications from text attributes like "urban night scene".
Synthetic Pre-Fine-Tuning: Models pre-trained on automatically generated instruction data (e.g., Orca-2) close the zero-shot gap with human-annotated data.

Implementation Tips

Add high-level role/context in a system message—"You are an expert medical assistant"—to steer domain tone.
Evaluate zero-shot before few-shot; large models may already surpass threshold.
Monitor confidence or log-probabilities to detect low-certainty answers.

Hendrycks et al., Measuring Massive Multitask Language Understanding, 2020. ↩

Command Palette