What's in GPT-5?

A comprehensive overview of the diverse datasets that are likely used to train OpenAI's forthcoming flagship model.

TL;DR
• ~479 TB of raw text, code & multimedia metadata prior to filtering (~0.5 quadrillion tokens).
• ~70 T filtered tokens (≈281 TB) after deduplication & quality gating.
• Strong shift toward synthetic data and high-quality licensed corpora (News Corp, Reddit, Stack Exchange, etc.).

The behaviour, knowledge, and limitations of every large-language model are a direct function of the corpus it ingests during pre-training. Understanding the provenance, scale, and composition of that corpus is therefore essential for:

Gauging factual coverage and potential blind-spots.
Anticipating biases introduced by over- or under-represented domains.
Estimating compute requirements and likely model capacity.
Designing robust evaluation strategies (e.g. held-out benchmarks).

High-level breakdown

Category	Representative sources	Unfiltered tokens	Share
Synthetic	Cosmopedia-style LLM-generated textbooks & dialogues	≈50 T	10.4 %
Web (general)	Common Crawl (filtered)	≈4 T	0.8 %
Discussion & Q&A	Reddit posts & comments, Stack Exchange	≈5.4 T	1.1 %
Academic	Consensus NLP, arXiv, papers	≈3.4 T	0.7 %
News / Publishers	News Corp, Time Inc, FT, AP, Vox, etc.	≈7 T	1.4 %
Code	GitHub	≈0.4 T	0.1 %
Legal	FreeLaw PACER & opinions	≈0.44 T	0.1 %
Books	Books 1-3, Wikipedia	≈54 B	<0.1 %
Other	Shutterstock metadata, Icelandic language corpora, Khan Academy, etc.	≈1 T	0.2 %
Total (pre-filter)	—	≈479 T	100 %

Numbers come from Alan D. Thompson's July 2024 report What's in GPT-5? and public partnership disclosures by OpenAI. See Appendix C for the full CSV.

Synthetic data

OpenAI appears to have embraced self-generated content at scale, leveraging earlier models (GPT-4-class) plus human curation to amplify domain-specific coverage. The approach resembles Microsoft's phi textbooks and Hugging Face's Cosmopedia:

20+ M prompts → 25 B+ synthetic tokens → filtered for quality & diversity → blended with real-world corpora

Advantages:

• Infinite supply of "textbook-quality" explanations
• Easy control over style, difficulty, and domain
• Reduced licensing constraints compared with copyrighted web text

Risks:

• Distributional shift if synthetic samples dominate
• Error amplification / model drift

Licensed publisher content

Since mid-2023 OpenAI has signed content partnerships with major media groups (Reddit, News Corp, Financial Times, Dotdash Meredith, Time, Vox, AP). These deals grant access to paywalled archives that:

• Expand temporal coverage back to the 1940s (Factiva, AP)
• Provide authoritative writing styles useful for RLHF preference models
• Mitigate legal exposure around copyright-infringing scraping

Parameter count estimate

Assuming a Chinchilla-style compute-optimal regime ( #params ≈ 20 × #tokens ) and ~70 T final training tokens, GPT-5 could weigh in at roughly 1.3 T parameters. Hardware advances (H100/N100, in-house ASICs) make such scale plausible by 2025.

Appendix C – Selected dataset table

(CSV abridged for brevity; see linked sheet for full numbers.)

Dataset	Type	Modality	Tokens (T, filtered)	Notes
Synthetic	Synthetic (LLM-generated)	Text	50.0	Synthetic LLM textbooks and dialogues
Reddit Posts (outbound links)	Web	Text	4.8	Top-half quality outbound Reddit posts
Common Crawl – General	Web	Text	4.0	FineWeb filtered Common Crawl subset
Consensus NLP	Academic papers	Text	3.4	200 M research papers via Consensus
YouTube Subtitles	Dialogue	Text	3.2	2 % sample of video subtitles
News Corp / WSJ	News articles	Text	1.9	Licensed Dow Jones Factiva archive
Common Crawl – Edu	Web (edu)	Text	1.3	FineWeb-Edu 1.3 T educational crawl
Reddit Comments	Dialogue	Text	0.64	2 % high-quality Reddit comments
FreeLaw – PACER	Legal	Text	0.40	US federal court filings corpus
GitHub	Code	Code	0.39	163 M public repos snapshot
Books2 (GPT-3)	Journals	Text	0.055	Books2 corpus from GPT-3 training
FreeLaw – Opinions	Legal	Text	0.038	US court opinions FreeLaw corpus
Stack Exchange	Q&A	Text	0.038	50 M questions across Exchange sites
Books3 (The Pile)	Journals	Text	0.025	Books3 subset from The Pile
Shutterstock Metadata	Metadata	Text	0.020	Image and video metadata descriptions
Wikipedia – Multilingual	Wiki	Text	0.015	All multilingual Wikipedia article dumps
Books1 (GPT-3)	Books	Text	0.012	Project Gutenberg style public books
Time Inc	News articles	Text	0.008	Time magazine eight-million-article archive
Le Monde & Prisa Media	News articles	Text	0.003	French and Spanish newspaper archives
Associated Press (AP)	News articles	Text	0.002	AP news stories over 100 years
Wikipedia – English	Wiki	Text	0.002	Full English Wikipedia article dump
Dotdash Meredith	News articles	Text	0.001	About.com and Meredith publishers archive
Financial Times	News articles	Text	0.001	FT.com business news article archive
Axel Springer / Business Insider	Magazines	Text	0.001	Business Insider licensed articles corpus
Icelandic Government Corpus	Multilingual	Text	0.001	Icelandic language parliamentary records corpus
Khan Academy	Q&A	Text	0.001	Videos, articles, exercises educational content
ExamSolutions	Q&A	Text	0.001	UK math exam question dataset

GPT-5 Dataset

What's in GPT-5?

High-level breakdown

Synthetic data

Licensed publisher content

Parameter count estimate

Further reading

Appendix C – Selected dataset table

Command Palette

GPT-5 Dataset

What's in GPT-5?

High-level breakdown

Synthetic data

Licensed publisher content

Parameter count estimate

Further reading

Appendix C – Selected dataset table