GitHub - angelos-p/llm-from-scratch

Train Your Own LLM From Scratch

A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.

Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.

This workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session.

No black-box libraries. No model = AutoModel.from_pretrained() . You build it all.

What You'll Build

A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You'll write:

Tokenizer — turning text into numbers the model can process

— turning text into numbers the model can process Model architecture — the transformer: embeddings, attention, feed-forward layers

— the transformer: embeddings, attention, feed-forward layers Training loop — forward pass, loss, backprop, optimizer, learning rate scheduling

— forward pass, loss, backprop, optimizer, learning rate scheduling Text generation — sampling from your trained model

Prerequisites

Any laptop or desktop (Mac, Linux, or Windows)

Python 3.12+

Comfort reading Python code (you don't need ML experience)

Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab — upload the files and run with !python train.py .

Getting Started

Local (recommended)

Install uv if you don't have it:

# macOS / Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Windows powershell -ExecutionPolicy ByPass -c " irm https://astral.sh/uv/install.ps1 | iex "

Then set up the project:

uv sync mkdir scratchpad && cd scratchpad

Google Colab

If you don't have a local setup, upload the repo to Colab and install dependencies:

!p ip install torch numpy tqdm tiktoken

Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py .

Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you'll have a working model.py , train.py , and generate.py that you wrote yourself.

Architecture: GPT at a Glance

Input Text │ ▼ ┌─────────────────┐ │ Tokenizer │ "hello" → [20, 43, 50, 50, 53] (character-level) └────────┬────────┘ ▼ ┌─────────────────┐ │ Token Embed + │ token IDs → vectors (n_embd dimensions) │ Position Embed │ + positional information └────────┬────────┘ ▼ ┌─────────────────┐ │ Transformer │ × n_layer │ Block: │ │ ┌────────────┐ │ │ │ LayerNorm │ │ │ │ Self-Attn │ │ n_head parallel attention heads │ │ + Residual │ │ │ ├────────────┤ │ │ │ LayerNorm │ │ │ │ MLP (FFN) │ │ expand 4x, GELU, project back │ │ + Residual │ │ │ └────────────┘ │ └────────┬────────┘ ▼ ┌─────────────────┐ │ LayerNorm │ │ Linear → logits│ vocab_size outputs (probability over next token) └─────────────────┘

Model Configs for This Workshop

Config Params n_layer n_head n_embd Train Time (M3 Pro) Tiny ~0.5M 2 2 128 ~5 min Small ~4M 4 4 256 ~20 min Medium (default) ~10M 6 6 384 ~45 min

All configs use character-level tokenization (vocab_size=65) and block_size=256.

Tokenization: Characters vs BPE

This workshop uses character-level tokenization on Shakespeare. BPE tokenization (GPT-2's 50k vocab) doesn't work on small datasets — most token bigrams are too rare for the model to learn patterns from.

Tokenizer Vocab Size Dataset Size Needed Character-level ~65 Small (Shakespeare, ~1MB) BPE (tiktoken) 50,257 Large (TinyStories+, 100MB+)

Part 5 covers switching to BPE for larger datasets.