Command Palette

Search for a command to run...

GPT-5 Dataset

Benched.ai Editorial Team

What's in GPT-5? A look at the training data powering modern LLMs

What's in GPT-5?

A comprehensive overview of the diverse datasets that are likely used to train OpenAI's forthcoming flagship model.

TL;DR
• ~479 TB of raw text, code & multimedia metadata prior to filtering (~0.5 quadrillion tokens).
• ~70 T filtered tokens (≈281 TB) after deduplication & quality gating.
• Strong shift toward synthetic data and high-quality licensed corpora (News Corp, Reddit, Stack Exchange, etc.).

The behaviour, knowledge, and limitations of every large-language model are a direct function of the corpus it ingests during pre-training. Understanding the provenance, scale, and composition of that corpus is therefore essential for:

  1. Gauging factual coverage and potential blind-spots.
  2. Anticipating biases introduced by over- or under-represented domains.
  3. Estimating compute requirements and likely model capacity.
  4. Designing robust evaluation strategies (e.g. held-out benchmarks).

  High-level breakdown

CategoryRepresentative sourcesUnfiltered tokensShare
SyntheticCosmopedia-style LLM-generated textbooks & dialogues≈50 T10.4 %
Web (general)Common Crawl (filtered)≈4 T0.8 %
Discussion & Q&AReddit posts & comments, Stack Exchange≈5.4 T1.1 %
AcademicConsensus NLP, arXiv, papers≈3.4 T0.7 %
News / PublishersNews Corp, Time Inc, FT, AP, Vox, etc.≈7 T1.4 %
CodeGitHub≈0.4 T0.1 %
LegalFreeLaw PACER & opinions≈0.44 T0.1 %
BooksBooks 1-3, Wikipedia≈54 B<0.1 %
OtherShutterstock metadata, Icelandic language corpora, Khan Academy, etc.≈1 T0.2 %
Total (pre-filter)≈479 T100 %

Numbers come from Alan D. Thompson's July 2024 report What's in GPT-5? and public partnership disclosures by OpenAI. See Appendix C for the full CSV.

  Synthetic data

OpenAI appears to have embraced self-generated content at scale, leveraging earlier models (GPT-4-class) plus human curation to amplify domain-specific coverage. The approach resembles Microsoft's phi textbooks and Hugging Face's Cosmopedia:

20+ M prompts → 25 B+ synthetic tokens → filtered for quality & diversity → blended with real-world corpora

Advantages:

• Infinite supply of "textbook-quality" explanations
• Easy control over style, difficulty, and domain
• Reduced licensing constraints compared with copyrighted web text

Risks:

• Distributional shift if synthetic samples dominate
• Error amplification / model drift

  Licensed publisher content

Since mid-2023 OpenAI has signed content partnerships with major media groups (Reddit, News Corp, Financial Times, Dotdash Meredith, Time, Vox, AP). These deals grant access to paywalled archives that:

• Expand temporal coverage back to the 1940s (Factiva, AP)
• Provide authoritative writing styles useful for RLHF preference models
• Mitigate legal exposure around copyright-infringing scraping

  Parameter count estimate

Assuming a Chinchilla-style compute-optimal regime ( #params ≈ 20 × #tokens ) and ~70 T final training tokens, GPT-5 could weigh in at roughly 1.3 T parameters. Hardware advances (H100/N100, in-house ASICs) make such scale plausible by 2025.

  Further reading

• Thompson, A.D. (2024). What's in GPT-5? LifeArchitect.ai.
• Hugging Face (2024). Cosmopedia: Synthetic data at scale.
• OpenAI (2024). Data & Partnerships blog posts.

  Appendix C – Selected dataset table

(CSV abridged for brevity; see linked sheet for full numbers.)

DatasetTypeModalityTokens (T, filtered)Notes
SyntheticSynthetic (LLM-generated)Text50.0Synthetic LLM textbooks and dialogues
Reddit Posts (outbound links)WebText4.8Top-half quality outbound Reddit posts
Common Crawl – GeneralWebText4.0FineWeb filtered Common Crawl subset
Consensus NLPAcademic papersText3.4200 M research papers via Consensus
YouTube SubtitlesDialogueText3.22 % sample of video subtitles
News Corp / WSJNews articlesText1.9Licensed Dow Jones Factiva archive
Common Crawl – EduWeb (edu)Text1.3FineWeb-Edu 1.3 T educational crawl
Reddit CommentsDialogueText0.642 % high-quality Reddit comments
FreeLaw – PACERLegalText0.40US federal court filings corpus
GitHubCodeCode0.39163 M public repos snapshot
Books2 (GPT-3)JournalsText0.055Books2 corpus from GPT-3 training
FreeLaw – OpinionsLegalText0.038US court opinions FreeLaw corpus
Stack ExchangeQ&AText0.03850 M questions across Exchange sites
Books3 (The Pile)JournalsText0.025Books3 subset from The Pile
Shutterstock MetadataMetadataText0.020Image and video metadata descriptions
Wikipedia – MultilingualWikiText0.015All multilingual Wikipedia article dumps
Books1 (GPT-3)BooksText0.012Project Gutenberg style public books
Time IncNews articlesText0.008Time magazine eight-million-article archive
Le Monde & Prisa MediaNews articlesText0.003French and Spanish newspaper archives
Associated Press (AP)News articlesText0.002AP news stories over 100 years
Wikipedia – EnglishWikiText0.002Full English Wikipedia article dump
Dotdash MeredithNews articlesText0.001About.com and Meredith publishers archive
Financial TimesNews articlesText0.001FT.com business news article archive
Axel Springer / Business InsiderMagazinesText0.001Business Insider licensed articles corpus
Icelandic Government CorpusMultilingualText0.001Icelandic language parliamentary records corpus
Khan AcademyQ&AText0.001Videos, articles, exercises educational content
ExamSolutionsQ&AText0.001UK math exam question dataset