GPT-5 vs o4 Training

OpenAI has not released an official data-card for GPT-5 or o4. Nevertheless, a mosaic of technical reports, system cards, partner documentation, academic papers and investigative journalism lets us sketch a reasonably precise picture of the corpora that feed these models at each stage of training. Below is a research brief that integrates what is known, what is strongly inferred and where genuine uncertainty remains.

Both GPT-5 and o4 draw on a multi-trillion-token pre-training mixture dominated by filtered Common Crawl, large licensed book scans, code from public software repos and high-quality reference corpora such as Wikipedia and arXiv. They then diverge in post-training. GPT-5 continues the GPT-4 lineage of supervised fine-tuning and RLHF, augmented by massive synthetic-data self-play and safety-oriented reward models. o4, by contrast, layers an additional Reinforcement Fine-Tuning (RFT) pass that relies on formal "grader" functions rather than human comparison labels, letting practitioners optimise bespoke reasoning skills with far less data.

Indicative composition of the shared pre-training corpus and the model-specific post-training additions

Corpus / Phase	Primary content & scale	Role in GPT-5	Role in o4
Common Crawl (curated, deduped)	≈90 TB post-2023; ~60–80 % of raw tokens	Core language grounding	Same¹²
Licensed book scans & public-domain libraries	242 B tokens from Harvard/BPL/Bodleian release plus Books1/2 and redacted Books3 subset	Fills long-form narrative gap; improves multilingual coverage	Same³⁴
Code (GitHub public, permissive-license mirrors)	Billions of lines; ≈5–10 % of mixture	Enables planning and tool-use	Same⁵
Academic & reference (Wikipedia, arXiv, PubMed, EU regs)	High-quality factual spine; ≈5 % of tokens	Same	Same¹⁶
Multimodal pairs (alt-text, WIT, LAION-400M) & audio transcripts	Text–vision–audio pre-training pairs	Extended to end-to-end "omni" training	Narrower vision slice only⁷
Synthetic self-play data (GPT-4 / o-series generated)	Hundreds of billions of tokens filtered for novelty & safety	Major expansion to overcome data-sparsity ceiling	Used sparingly⁸⁹
Supervised fine-tuning corpus (SFT)	Tens of millions of curated dialogues & task demonstrations	Rich instruction-following seed	Same⁶⁹
RLHF / RLAIF pairs	>100 M preference votes; CriticGPT assists labelers	Safety & helpfulness tuning	Baseline alignment before RFT⁹¹⁰
RFT grader datasets (o4-specific)	JSONL prompts + programmatic graders (≤ 50 k examples)	Not used	Unique final pass for o4¹¹

Pre-training corpus in detail

OpenAI's data pipeline still begins with large-scale Common Crawl snapshots, filtered with quality, deduplication, adult-content and personally-identifiable-information screens—a methodology unchanged since GPT-3 but applied to far newer crawls¹. Internal procurement documents note that Common Crawl has contributed "> 80 % of raw tokens used to train GPT-class models²".

To stave off crawl-quality saturation, OpenAI struck digitisation partnerships with Harvard, the Boston Public Library and Oxford's Bodleian, yielding 394 million scanned pages (≈ 242 B tokens) of mostly public-domain long-form text³. Analysts estimate GPT-5 ingests at least 70 T cleaned tokens across dozens of data silos, totalling about 281 TB¹².

Code remains indispensable. Investigations into GitHub Copilot confirmed wholesale ingestion of public repositories⁵. OpenAI's own Codex literature describes "billions of lines of code" spanning dozens of languages, and insiders expect a further 10× expansion in reasoning-oriented code and tool-use samples for GPT-5¹³.

Multimodal text–vision–audio alignment stems from the GPT-4o pipeline, whose system card confirms end-to-end training over joint modalities; GPT-5 therefore inherits image+audio pairs, while o4 (text-vision only) samples a reduced slice⁷.

Finally, OpenAI supplements natural data with synthetic self-play. A Wall Street Journal report on project "Orion" (an internal codename for GPT-5) details large-scale synthetic augmentation to overcome high-quality data limits⁸. CriticGPT, a GPT-4-based AI reviewer, was built to reduce the cost and inconsistency of human preference labels⁹.

Post-training regimes common to both models

After raw next-token training, both models pass through Supervised Fine-Tuning on curated instruction-following examples and then Reinforcement Learning from Human or AI Feedback (RLHF / RLAIF). The GPT-4 research blog describes injecting an additional safety reward during RLHF to reduce disallowed content, a strategy echoed in the GPT-4 system card and inherited by subsequent models⁶. OpenAI researchers later extended alignment with the instruction-hierarchy method, which uses auto-generated contrasting examples to teach a model to privilege system over user instructions; because GPT-5 is trained after these advances it almost certainly embeds the same alignment corpus¹⁰.

The o4-specific post-training twist: Reinforcement Fine-Tuning

Azure's preview documentation lays out Reinforcement Fine-Tuning (RFT), available only for o4-mini today¹¹. Instead of human comparisons, RFT relies on explicit grader functions—string checks, text-similarity metrics or small LLM critic models—to score each sampled answer. The policy is then updated with policy-gradient RL. Notable traits include:

No system messages in training JSONL; the final message must be user-authored (ensures pure reward attribution).
A reasoning-effort flag that lets trainers trade compute for deeper chain-of-thought.
Grader composability, enabling compound skills such as mathematical accuracy + style.
Cost guardrails that auto-pause jobs at roughly $5 k to avoid runaway spending.

This regime is absent from publicly documented GPT-4/4.5 pipelines, implying it is an o-series experiment that may or may not migrate into GPT-5.

Outstanding unknowns and research gaps

Exact token counts per domain. Public estimates range from 40 T to 100 T total tokens¹⁴.
Synthetic vs natural mix ratio. WSJ sources mention "multiple costly training runs" to tune the synthetic fraction⁸, but numbers are unreleased.
Audio and video textification. The quantity of Whisper-transcribed media used in GPT-5 is unknown, though GPT-4o suggests substantial inclusion⁷.
Future RFT adoption. If field trials demonstrate superior downstream safety, RFT could replace vanilla RLHF across the GPT line; monitoring upcoming system cards is advisable.

Conclusion

While the GPT-5 and o4 training stacks remain proprietary, the convergence of technical papers, system cards and partner documentation reveals a consistent architecture: vast but aggressively filtered web-scale pre-training followed by multilayer alignment. GPT-5 scales every stage—data volume, synthetic augmentation and reward-model complexity—whereas o4 pioneers a leaner, grader-driven RFT loop that may become the next alignment standard. Keeping track of future transparency reports and system cards will be essential for updating these inferences.

Command Palette