Learn LLMs the Karpathy Way

Large-language-model newcomers often bounce between scattered tutorials; Andrej Karpathy's video catalogue offers a single, logically layered on-ramp that starts with one-line-of-code neurons and ends with training a GPT-2-class model. The guide below turns that catalogue into an ordered study plan, matching each video to code, required maths, and suggested checkpoints so you can progress from "what's a token?" to "I fine-tuned my own mini-GPT" without corporate marketing or filler.

Getting oriented

Karpathy hosts three distinct learning tracks:

Neural Networks: Zero to Hero — a seven-part playlist that builds up from scalar back-propagation to a vanilla transformer, culminating in a full GPT implemented live on screen.
Stand-alone deep-dive talks such as Intro to Large Language Models and the Stanford CS-25 transformer lecture; these supply the conceptual scaffolding (attention, scaling laws, context windows) you'll need before writing large code.
Project livestreams—Let's build GPT, Let's build the GPT Tokenizer, and the three-hour Deep Dive into LLMs like ChatGPT—where he assembles and interrogates working code bases such as nanoGPT.

Bookmark the course home-page for updated links, slides, and a busy Discord study group.

What you should already know

Python & basic NumPy — all examples use idiomatic vectorised code.
A dash of calculus and linear algebra — enough to follow the derivation of gradients and matrix multiplications.
A CUDA-capable GPU if you intend to train the larger nanoGPT models locally. The README lists memory footprints and runtimes.

If you are short on maths, watch the first Zero-to-Hero episode:

Roadmap: videos, checkpoints, and code

Stage 1 Conceptual warm-up (1 evening)

Mini-exercise
Use `tiktoken` to tokenise a paragraph; inspect token IDs
Hand-compute a 2-token self-attention example in a notebook

Stage 2 Char-level GPT from scratch (weekend project)

Follow the two-hour livestream where Karpathy codes a minimal transformer live, only using NumPy and zero external frameworks.

Checkpoint after each milestone: bigram model → adding multi-head attention → residual feed-forward → training on tiny-Shakespeare for 10 epochs. Compare your loss curve to the video; a mismatch means a bug.

Stage 3 Tokenisation mechanics (half-day)

Re-implement BPE merges for a five-word toy corpus, then feed your trained tokenizer back into the model from Stage 2 and measure perplexity drop.

Stage 4 Scaling up with nanoGPT (one week)

Clone the repository (42 k GitHub stars) and read the top-level train.py and model.py—each ≈300 lines.

Karpathy's livestream shows how to reproduce GPT-2 (124 M parameters) on OpenWebText with a single 8×A100 node in four days; you can instead fine-tune the provided 124 M checkpoint on a custom dataset in a few hours on one consumer GPU:

run the provided prepare.py to shard your corpus;
adjust the YAML config for context length and batch size;
launch train.py; monitor loss in TensorBoard;
use sample.py to generate text every N tokens.

For a gentler start, the companion build-nanogpt repo keeps each commit atomic so you can diff every change.

Stage 5 System-level understanding (long weekend)

Set aside 3 h 30 min for the chaptered lecture that ties tokenizer-level details to alignment, RLHF, tool use, and system prompts.

Create a personal study sheet: pause after each chapter (pre-training, post-training, safety, inference) and summarise the main equations or heuristics; this will double retention compared with passive viewing.

Stage 6 Evaluation and fine-tuning practice (ongoing)

Run the State of GPT talk to learn typical error categories and benchmarking suites.
Evaluate your fine-tuned model on lm-eval-harness using the hellaswag and arc_easy tasks.
Iterate: identify failure patterns (e.g., arithmetic), add synthetic training data, and re-fine-tune for a few epochs.

Auxiliary resources and community

Karpathy's X thread announcing Let's build GPT contains the original motivation and a link to community Q&A.
The Discord server linked on the Zero-to-Hero page hosts weekly study sessions and troubleshooting channels.
Hackers on Medium and personal blogs post condensed notes; for example, a February 2025 TL;DW captures the 3-hour deep dive in 15 minutes.
Stanford's Transformers United playlist houses complementary guest lectures on retrieval and alignment that pair well with Stage 5.

Common stumbling blocks and fixes

Gradient explosion at higher learning rates when scaling from char-level to BPE; lower lr by 10× and enable gradient clipping.
CUDA out-of-memory on 12 GB GPUs; set micro_batch_size=4 and accumulate gradients.
Mismatched tokenizer checkpoints produce gibberish; always save the exact encoder.json and vocab.bpe used during training.

Where to go next

After mastering Karpathy's path:

Explore retrieval-augmented generation (RAG) to feed documents to your model.
Implement lightweight alignment via preference ranking on your fine-tuned GPT.
Re-watch the deep-dive chapters on tool invocation and function-calling once you have an API key for a hosted LLM and can compare behaviours.

By following the sequence above—lecture context, code walkthrough, project build, and evaluation—you will move from first principles to a functioning GPT clone and a clear mental model of modern LLM pipelines, all guided by Karpathy's transparent coding style and without drifting into vendor hype.

Command Palette