OpenAI has not released an official data-card for GPT-5 or o4. Nevertheless, a mosaic of technical reports, system cards, partner documentation, academic papers and investigative journalism lets us sketch a reasonably precise picture of the corpora that feed these models at each stage of training. Below is a research brief that integrates what is known, what is strongly inferred and where genuine uncertainty remains.
Both GPT-5 and o4 draw on a multi-trillion-token pre-training mixture dominated by filtered Common Crawl, large licensed book scans, code from public software repos and high-quality reference corpora such as Wikipedia and arXiv. They then diverge in post-training. GPT-5 continues the GPT-4 lineage of supervised fine-tuning and RLHF, augmented by massive synthetic-data self-play and safety-oriented reward models. o4, by contrast, layers an additional Reinforcement Fine-Tuning (RFT) pass that relies on formal "grader" functions rather than human comparison labels, letting practitioners optimise bespoke reasoning skills with far less data.
Indicative composition of the shared pre-training corpus and the model-specific post-training additions
Pre-training corpus in detail
OpenAI's data pipeline still begins with large-scale Common Crawl snapshots, filtered with quality, deduplication, adult-content and personally-identifiable-information screens—a methodology unchanged since GPT-3 but applied to far newer crawls1. Internal procurement documents note that Common Crawl has contributed "> 80 % of raw tokens used to train GPT-class models2".
To stave off crawl-quality saturation, OpenAI struck digitisation partnerships with Harvard, the Boston Public Library and Oxford's Bodleian, yielding 394 million scanned pages (≈ 242 B tokens) of mostly public-domain long-form text3. Analysts estimate GPT-5 ingests at least 70 T cleaned tokens across dozens of data silos, totalling about 281 TB12.
Code remains indispensable. Investigations into GitHub Copilot confirmed wholesale ingestion of public repositories5. OpenAI's own Codex literature describes "billions of lines of code" spanning dozens of languages, and insiders expect a further 10× expansion in reasoning-oriented code and tool-use samples for GPT-513.
Multimodal text–vision–audio alignment stems from the GPT-4o pipeline, whose system card confirms end-to-end training over joint modalities; GPT-5 therefore inherits image+audio pairs, while o4 (text-vision only) samples a reduced slice7.
Finally, OpenAI supplements natural data with synthetic self-play. A Wall Street Journal report on project "Orion" (an internal codename for GPT-5) details large-scale synthetic augmentation to overcome high-quality data limits8. CriticGPT, a GPT-4-based AI reviewer, was built to reduce the cost and inconsistency of human preference labels9.
Post-training regimes common to both models
After raw next-token training, both models pass through Supervised Fine-Tuning on curated instruction-following examples and then Reinforcement Learning from Human or AI Feedback (RLHF / RLAIF). The GPT-4 research blog describes injecting an additional safety reward during RLHF to reduce disallowed content, a strategy echoed in the GPT-4 system card and inherited by subsequent models6. OpenAI researchers later extended alignment with the instruction-hierarchy method, which uses auto-generated contrasting examples to teach a model to privilege system over user instructions; because GPT-5 is trained after these advances it almost certainly embeds the same alignment corpus10.
The o4-specific post-training twist: Reinforcement Fine-Tuning
Azure's preview documentation lays out Reinforcement Fine-Tuning (RFT), available only for o4-mini today11. Instead of human comparisons, RFT relies on explicit grader functions—string checks, text-similarity metrics or small LLM critic models—to score each sampled answer. The policy is then updated with policy-gradient RL. Notable traits include:
- No system messages in training JSONL; the final message must be user-authored (ensures pure reward attribution).
- A reasoning-effort flag that lets trainers trade compute for deeper chain-of-thought.
- Grader composability, enabling compound skills such as mathematical accuracy + style.
- Cost guardrails that auto-pause jobs at roughly $5 k to avoid runaway spending.
This regime is absent from publicly documented GPT-4/4.5 pipelines, implying it is an o-series experiment that may or may not migrate into GPT-5.
Outstanding unknowns and research gaps
- Exact token counts per domain. Public estimates range from 40 T to 100 T total tokens14.
- Synthetic vs natural mix ratio. WSJ sources mention "multiple costly training runs" to tune the synthetic fraction8, but numbers are unreleased.
- Audio and video textification. The quantity of Whisper-transcribed media used in GPT-5 is unknown, though GPT-4o suggests substantial inclusion7.
- Future RFT adoption. If field trials demonstrate superior downstream safety, RFT could replace vanilla RLHF across the GPT line; monitoring upcoming system cards is advisable.
Conclusion
While the GPT-5 and o4 training stacks remain proprietary, the convergence of technical papers, system cards and partner documentation reveals a consistent architecture: vast but aggressively filtered web-scale pre-training followed by multilayer alignment. GPT-5 scales every stage—data volume, synthetic augmentation and reward-model complexity—whereas o4 pioneers a leaner, grader-driven RFT loop that may become the next alignment standard. Keeping track of future transparency reports and system cards will be essential for updating these inferences.