What's in GPT-5?
A comprehensive overview of the diverse datasets that are likely used to train OpenAI's forthcoming flagship model.
TL;DR
• ~479 TB of raw text, code & multimedia metadata prior to filtering (~0.5 quadrillion tokens).
• ~70 T filtered tokens (≈281 TB) after deduplication & quality gating.
• Strong shift toward synthetic data and high-quality licensed corpora (News Corp, Reddit, Stack Exchange, etc.).
The behaviour, knowledge, and limitations of every large-language model are a direct function of the corpus it ingests during pre-training. Understanding the provenance, scale, and composition of that corpus is therefore essential for:
- Gauging factual coverage and potential blind-spots.
- Anticipating biases introduced by over- or under-represented domains.
- Estimating compute requirements and likely model capacity.
- Designing robust evaluation strategies (e.g. held-out benchmarks).
High-level breakdown
Numbers come from Alan D. Thompson's July 2024 report What's in GPT-5? and public partnership disclosures by OpenAI. See Appendix C for the full CSV.
Synthetic data
OpenAI appears to have embraced self-generated content at scale, leveraging earlier models (GPT-4-class) plus human curation to amplify domain-specific coverage. The approach resembles Microsoft's phi textbooks and Hugging Face's Cosmopedia:
20+ M prompts → 25 B+ synthetic tokens → filtered for quality & diversity → blended with real-world corpora
Advantages:
• Infinite supply of "textbook-quality" explanations
• Easy control over style, difficulty, and domain
• Reduced licensing constraints compared with copyrighted web text
Risks:
• Distributional shift if synthetic samples dominate
• Error amplification / model drift
Licensed publisher content
Since mid-2023 OpenAI has signed content partnerships with major media groups (Reddit, News Corp, Financial Times, Dotdash Meredith, Time, Vox, AP). These deals grant access to paywalled archives that:
• Expand temporal coverage back to the 1940s (Factiva, AP)
• Provide authoritative writing styles useful for RLHF preference models
• Mitigate legal exposure around copyright-infringing scraping
Parameter count estimate
Assuming a Chinchilla-style compute-optimal regime ( #params ≈ 20 × #tokens ) and ~70 T final training tokens, GPT-5 could weigh in at roughly 1.3 T parameters. Hardware advances (H100/N100, in-house ASICs) make such scale plausible by 2025.
Further reading
• Thompson, A.D. (2024). What's in GPT-5? LifeArchitect.ai.
• Hugging Face (2024). Cosmopedia: Synthetic data at scale.
• OpenAI (2024). Data & Partnerships blog posts.
Appendix C – Selected dataset table
(CSV abridged for brevity; see linked sheet for full numbers.)