Command Palette

Search for a command to run...

Learn LLMs the Karpathy Way

Stephen M. Walker II

A structured study plan that pairs Karpathy's video catalogue with hands-on checkpoints so you can progress from "what's a token?" to training your own mini-GPT.

Large-language-model newcomers often bounce between scattered tutorials; Andrej Karpathy's video catalogue offers a single, logically layered on-ramp that starts with one-line-of-code neurons and ends with training a GPT-2-class model. The guide below turns that catalogue into an ordered study plan, matching each video to code, required maths, and suggested checkpoints so you can progress from "what's a token?" to "I fine-tuned my own mini-GPT" without corporate marketing or filler.

  Getting oriented

Karpathy hosts three distinct learning tracks:

  • Neural Networks: Zero to Hero — a seven-part playlist that builds up from scalar back-propagation to a vanilla transformer, culminating in a full GPT implemented live on screen.
  • Stand-alone deep-dive talks such as Intro to Large Language Models and the Stanford CS-25 transformer lecture; these supply the conceptual scaffolding (attention, scaling laws, context windows) you'll need before writing large code.
  • Project livestreams—Let's build GPT, Let's build the GPT Tokenizer, and the three-hour Deep Dive into LLMs like ChatGPT—where he assembles and interrogates working code bases such as nanoGPT.

Bookmark the course home-page for updated links, slides, and a busy Discord study group.

  What you should already know

  1. Python & basic NumPy — all examples use idiomatic vectorised code.
  2. A dash of calculus and linear algebra — enough to follow the derivation of gradients and matrix multiplications.
  3. A CUDA-capable GPU if you intend to train the larger nanoGPT models locally. The README lists memory footprints and runtimes.

If you are short on maths, watch the first Zero-to-Hero episode:

  Roadmap: videos, checkpoints, and code

    Stage 1 Conceptual warm-up (1 evening)

Mini-exercise
Use tiktoken to tokenise a paragraph; inspect token IDs
Hand-compute a 2-token self-attention example in a notebook

    Stage 2 Char-level GPT from scratch (weekend project)

Follow the two-hour livestream where Karpathy codes a minimal transformer live, only using NumPy and zero external frameworks.

Checkpoint after each milestone: bigram model → adding multi-head attention → residual feed-forward → training on tiny-Shakespeare for 10 epochs. Compare your loss curve to the video; a mismatch means a bug.

    Stage 3 Tokenisation mechanics (half-day)

Re-implement BPE merges for a five-word toy corpus, then feed your trained tokenizer back into the model from Stage 2 and measure perplexity drop.

    Stage 4 Scaling up with nanoGPT (one week)

Clone the repository (42 k GitHub stars) and read the top-level train.py and model.py—each ≈300 lines.

Karpathy's livestream shows how to reproduce GPT-2 (124 M parameters) on OpenWebText with a single 8×A100 node in four days; you can instead fine-tune the provided 124 M checkpoint on a custom dataset in a few hours on one consumer GPU:

  • run the provided prepare.py to shard your corpus;
  • adjust the YAML config for context length and batch size;
  • launch train.py; monitor loss in TensorBoard;
  • use sample.py to generate text every N tokens.

For a gentler start, the companion build-nanogpt repo keeps each commit atomic so you can diff every change.

    Stage 5 System-level understanding (long weekend)

Set aside 3 h 30 min for the chaptered lecture that ties tokenizer-level details to alignment, RLHF, tool use, and system prompts.

Create a personal study sheet: pause after each chapter (pre-training, post-training, safety, inference) and summarise the main equations or heuristics; this will double retention compared with passive viewing.

    Stage 6 Evaluation and fine-tuning practice (ongoing)

  1. Run the State of GPT talk to learn typical error categories and benchmarking suites.

  2. Evaluate your fine-tuned model on lm-eval-harness using the hellaswag and arc_easy tasks.

  3. Iterate: identify failure patterns (e.g., arithmetic), add synthetic training data, and re-fine-tune for a few epochs.

  Auxiliary resources and community

  • Karpathy's X thread announcing Let's build GPT contains the original motivation and a link to community Q&A.
  • The Discord server linked on the Zero-to-Hero page hosts weekly study sessions and troubleshooting channels.
  • Hackers on Medium and personal blogs post condensed notes; for example, a February 2025 TL;DW captures the 3-hour deep dive in 15 minutes.
  • Stanford's Transformers United playlist houses complementary guest lectures on retrieval and alignment that pair well with Stage 5.

  Common stumbling blocks and fixes

  • Gradient explosion at higher learning rates when scaling from char-level to BPE; lower lr by 10× and enable gradient clipping.
  • CUDA out-of-memory on 12 GB GPUs; set micro_batch_size=4 and accumulate gradients.
  • Mismatched tokenizer checkpoints produce gibberish; always save the exact encoder.json and vocab.bpe used during training.

  Where to go next

After mastering Karpathy's path:

  • Explore retrieval-augmented generation (RAG) to feed documents to your model.
  • Implement lightweight alignment via preference ranking on your fine-tuned GPT.
  • Re-watch the deep-dive chapters on tool invocation and function-calling once you have an API key for a hosted LLM and can compare behaviours.

By following the sequence above—lecture context, code walkthrough, project build, and evaluation—you will move from first principles to a functioning GPT clone and a clear mental model of modern LLM pipelines, all guided by Karpathy's transparent coding style and without drifting into vendor hype.