Large-language-model newcomers often bounce between scattered tutorials; Andrej Karpathy's video catalogue offers a single, logically layered on-ramp that starts with one-line-of-code neurons and ends with training a GPT-2-class model. The guide below turns that catalogue into an ordered study plan, matching each video to code, required maths, and suggested checkpoints so you can progress from "what's a token?" to "I fine-tuned my own mini-GPT" without corporate marketing or filler.
Getting oriented
Karpathy hosts three distinct learning tracks:
- Neural Networks: Zero to Hero — a seven-part playlist that builds up from scalar back-propagation to a vanilla transformer, culminating in a full GPT implemented live on screen.
- Stand-alone deep-dive talks such as Intro to Large Language Models and the Stanford CS-25 transformer lecture; these supply the conceptual scaffolding (attention, scaling laws, context windows) you'll need before writing large code.
- Project livestreams—Let's build GPT, Let's build the GPT Tokenizer, and the three-hour Deep Dive into LLMs like ChatGPT—where he assembles and interrogates working code bases such as nanoGPT.
Bookmark the course home-page for updated links, slides, and a busy Discord study group.
What you should already know
- Python & basic NumPy — all examples use idiomatic vectorised code.
- A dash of calculus and linear algebra — enough to follow the derivation of gradients and matrix multiplications.
- A CUDA-capable GPU if you intend to train the larger nanoGPT models locally. The README lists memory footprints and runtimes.
If you are short on maths, watch the first Zero-to-Hero episode:
Roadmap: videos, checkpoints, and code
Stage 1 Conceptual warm-up (1 evening)
Stage 2 Char-level GPT from scratch (weekend project)
Follow the two-hour livestream where Karpathy codes a minimal transformer live, only using NumPy and zero external frameworks.
Checkpoint after each milestone: bigram model → adding multi-head attention → residual feed-forward → training on tiny-Shakespeare for 10 epochs. Compare your loss curve to the video; a mismatch means a bug.
Stage 3 Tokenisation mechanics (half-day)
Re-implement BPE merges for a five-word toy corpus, then feed your trained tokenizer back into the model from Stage 2 and measure perplexity drop.
Stage 4 Scaling up with nanoGPT (one week)
Clone the repository (42 k GitHub stars) and read the top-level train.py and model.py—each ≈300 lines.
Karpathy's livestream shows how to reproduce GPT-2 (124 M parameters) on OpenWebText with a single 8×A100 node in four days; you can instead fine-tune the provided 124 M checkpoint on a custom dataset in a few hours on one consumer GPU:
- run the provided
prepare.py
to shard your corpus; - adjust the YAML config for context length and batch size;
- launch
train.py
; monitor loss in TensorBoard; - use
sample.py
to generate text every N tokens.
For a gentler start, the companion build-nanogpt repo keeps each commit atomic so you can diff every change.
Stage 5 System-level understanding (long weekend)
Set aside 3 h 30 min for the chaptered lecture that ties tokenizer-level details to alignment, RLHF, tool use, and system prompts.
Create a personal study sheet: pause after each chapter (pre-training, post-training, safety, inference) and summarise the main equations or heuristics; this will double retention compared with passive viewing.
Stage 6 Evaluation and fine-tuning practice (ongoing)
-
Run the State of GPT talk to learn typical error categories and benchmarking suites.
-
Evaluate your fine-tuned model on
lm-eval-harness
using thehellaswag
andarc_easy
tasks. -
Iterate: identify failure patterns (e.g., arithmetic), add synthetic training data, and re-fine-tune for a few epochs.
Auxiliary resources and community
- Karpathy's X thread announcing Let's build GPT contains the original motivation and a link to community Q&A.
- The Discord server linked on the Zero-to-Hero page hosts weekly study sessions and troubleshooting channels.
- Hackers on Medium and personal blogs post condensed notes; for example, a February 2025 TL;DW captures the 3-hour deep dive in 15 minutes.
- Stanford's Transformers United playlist houses complementary guest lectures on retrieval and alignment that pair well with Stage 5.
Common stumbling blocks and fixes
- Gradient explosion at higher learning rates when scaling from char-level to BPE; lower
lr
by 10× and enable gradient clipping. - CUDA out-of-memory on 12 GB GPUs; set
micro_batch_size=4
and accumulate gradients. - Mismatched tokenizer checkpoints produce gibberish; always save the exact
encoder.json
andvocab.bpe
used during training.
Where to go next
After mastering Karpathy's path:
- Explore retrieval-augmented generation (RAG) to feed documents to your model.
- Implement lightweight alignment via preference ranking on your fine-tuned GPT.
- Re-watch the deep-dive chapters on tool invocation and function-calling once you have an API key for a hosted LLM and can compare behaviours.
By following the sequence above—lecture context, code walkthrough, project build, and evaluation—you will move from first principles to a functioning GPT clone and a clear mental model of modern LLM pipelines, all guided by Karpathy's transparent coding style and without drifting into vendor hype.