For many developers, the inner workings of Large Language Models (LLMs) can feel like a black box. While powerful, the scale and complexity of production-grade LLMs often obscure their foundational principles. Andrej Karpathy, known for his relentless pursuit of simplification in machine learning, tackles this challenge head-on with MicroGPT—a remarkable "art project" designed to distill the essence of a GPT model into its absolute bare essentials.

MicroGPT is a single, self-contained Python file, spanning a mere 200 lines, with zero external dependencies. This concise script encapsulates the full algorithmic content required to train and infer a GPT-like model. It's the culmination of projects like micrograd, makemore, and nanogpt, representing a decade-long effort to simplify LLMs for pedagogical clarity. Karpathy describes it as a beautiful realization of what’s truly necessary for an LLM to function; everything else, he posits, is simply about efficiency.

What MicroGPT Encompasses

This compact script provides a complete, runnable example of a generative pre-trained transformer. Its components include:

Dataset: The raw text fuel for the model.
Tokenizer: Converts text into numerical tokens and vice-versa.
Autograd Engine: A custom-built mechanism for automatic differentiation, crucial for training.
GPT-2-like Neural Network Architecture: The core model structure.
Adam Optimizer: The algorithm used to update model parameters during training.
Training Loop: Orchestrates the learning process.
Inference Loop: Generates new text based on the trained model.

Let’s delve into how MicroGPT achieves this astonishing feat of simplification.

The Dataset: Fueling the Model

LLMs learn from vast quantities of text data. While production models might use entire web pages, MicroGPT opts for a simpler, more focused dataset: a collection of approximately 32,000 names, one per line. The model's objective is to learn the statistical patterns within these names—like common letter sequences or structures—and then generate new, plausible-sounding names that adhere to these learned patterns.

For example, after training, MicroGPT can "hallucinate" names such as 'kamon', 'ann', 'karai', or 'jaire', demonstrating its grasp of the input distribution. This exercise beautifully illustrates the core function of an LLM: given a starting point (a prompt), it statistically completes a "document" (the response) in a way that aligns with its training data.

python

Let there be an input dataset `docs`: list[str] of documents (e.g. a dataset of names)

Q: How does MicroGPT's custom autograd engine differ algorithmically from PyTorch's backward() ?

Algorithmically, MicroGPT's Value class and its backward() method are identical to PyTorch's automatic differentiation. Both rely on constructing a computation graph and applying the chain rule in reverse topological order. The primary difference is in implementation: MicroGPT's Value objects handle single scalar numbers and their operations, whereas PyTorch's tensors operate on arrays of numbers, leveraging highly optimized C++ backends for vastly superior efficiency.

Q: What is the significance of the BOS (Beginning of Sequence) token in MicroGPT's tokenizer?

The BOS token serves as a crucial delimiter, marking the start and end of each document (name) in the dataset. By wrapping each name with BOS tokens (e.g., [BOS, e, m, m, a, BOS] ), the model learns that BOS initiates a new sequence and signifies its conclusion. This helps the model generate complete, coherent names rather than just an endless stream of characters, as it learns the statistical likelihood of BOS appearing at certain points.

if not os.path.exists('input.txt'): import urllib.request names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt' urllib.request.urlretrieve(names_url, 'input.txt') docs = [l.strip() for l in open('input.txt').read().strip().split(' ') if l.strip()] # list[str] of documents random.shuffle(docs) print(f"num docs: {len(docs)}")

The Tokenizer: Bridging Text and Numbers

Neural networks operate on numbers, not raw characters. A tokenizer is the bridge, converting text into sequences of integer token IDs. While sophisticated production tokenizers like OpenAI's tiktoken process character chunks for efficiency, MicroGPT uses the simplest approach: assigning a unique integer to each unique character found in the dataset.

In this case, the unique characters are primarily the lowercase English alphabet (a-z). Each character gets an ID corresponding to its sorted index. Importantly, these integer values are arbitrary; they merely represent distinct symbols. MicroGPT also introduces a special BOS (Beginning of Sequence) token. This token acts as a delimiter, signaling the start and end of a document (e.g., a name), teaching the model when to begin and conclude a generation. With 26 letters and one BOS token, the vocabulary size is 27.

Autograd: The Engine of Learning

Training neural networks fundamentally relies on gradients: knowing how much and in which direction to adjust each model parameter to reduce the prediction error (loss). MicroGPT implements its own autograd engine from scratch using a single Value class, mirroring the functionality of libraries like PyTorch but on scalar numbers rather than tensors.

Each Value object wraps a scalar data and tracks its computation history. When mathematical operations (e.g., add, multiply) are performed on Value objects, the result is a new Value that records its _children (inputs) and _local_grads (the derivative of the operation with respect to its inputs). For instance, in a * b, the local gradient with respect to a is b, and vice versa.

python class Value: slots = ('data', 'grad', '_children', '_local_grads') def init(self, data, children=(), local_grads=()): self.data = data self.grad = 0 self._children = children self._local_grads = local_grads

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    return Value(self.data + other.data, (self, other), (1, 1))
# ... (other operations like __mul__, __pow__, log, exp, relu)

def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children: build_topo(child)
            topo.append(v)
    build_topo(self)

    self.grad = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad

The backward() method is where the magic happens. It traverses the computation graph in reverse topological order, starting from the final loss node (initialized with grad=1). At each step, it applies the chain rule from calculus: if a value v has a child c and a local derivative ∂v/∂c, then the gradient accumulated at c is updated by ∂v/∂c * ∂L/∂v (where ∂L/∂v is the gradient of the loss L with respect to v). Gradients from multiple paths are summed (via +=), accurately reflecting how a single parameter can influence the loss through various computations.

This process provides each Value with a grad attribute, indicating how the final loss changes if that specific value is nudged. For example, if L = a * b + a, with a=2 and b=3, L.backward() would yield a.grad = 4.0 and b.grad = 2.0. This means increasing a by 0.001 would increase L by approximately 0.004, and increasing b by 0.001 would increase L by 0.002. These gradients are then used by an optimizer to iteratively adjust the model's parameters.

Parameters: The Model's Knowledge

Model parameters are the learned weights and biases that define the network's behavior. In MicroGPT, these are floating-point numbers, initially randomized (from a Gaussian distribution) and stored in a state_dict (similar to PyTorch). The parameters are organized into matrices for token embeddings (wte), position embeddings (wpe), attention mechanisms (attn_wq, wk, wv, wo), and Multi-Layer Perceptron (MLP) layers (mlp_fc1, mlp_fc2).

For this tiny model, there are 4,192 parameters. This is a stark contrast to modern LLMs, which can boast hundreds of billions of parameters, highlighting MicroGPT's focus on conceptual clarity over scale.

Architecture: A Simplified GPT-2

The core of MicroGPT is its neural network architecture, a simplified version of GPT-2. It processes one token at a time, considering its position and the context from previous tokens via a KV (Key-Value) cache. The architecture leverages three helper functions:

linear(x, w): Performs a matrix-vector multiplication, a fundamental linear transformation.
softmax(logits): Converts raw scores (logits) into a probability distribution over the vocabulary.
rmsnorm(x): Root Mean Square Normalization, stabilizing activations by rescaling vectors to have unit root-mean-square. It's a simpler alternative to LayerNorm.

The gpt function combines these elements:

Embeddings: The token_id and pos_id are looked up in their respective embedding tables (wte, wpe). These vectors are summed, creating a joint representation that encodes both the token's identity and its sequence position.
Attention Block: This is where the model determines the relevance of past tokens. The current token is transformed into a query (Q), key (K), and value (V). The keys and values of previous tokens are stored in the KV cache. Each "attention head" calculates dot products between its query and all cached keys, scales them, applies softmax to get attention weights, and then takes a weighted sum of the cached values. This mechanism allows the model to selectively "pay attention" to relevant parts of the input sequence. The outputs from multiple heads are concatenated and projected through attn_wo.
MLP Block: A simple feed-forward neural network that further processes the information from the attention block. It consists of two linear layers (mlp_fc1, mlp_fc2) separated by a ReLU activation function. Residual connections are used throughout to aid gradient flow.

Finally, the processed vector is passed through a lm_head linear layer to produce logits—raw scores for each possible next token in the vocabulary. These logits are then fed to the softmax function to yield probabilities.

Practical Takeaways

MicroGPT is a pedagogical masterpiece. It strips away the distributed computing, optimized kernels, and complex data pipelines common in production LLMs, revealing the core algorithms. For developers, it offers an unparalleled opportunity to:

Understand from First Principles: Witness how autograd, a character-level tokenizer, and a simplified transformer architecture come together in a functional LLM.
Demystify Complexity: Realize that even colossal models are built upon these fundamental, albeit scaled-up, building blocks.
Appreciate Efficiency: Understand why production systems require advanced libraries and hardware when seeing the algorithmic identity of scalar-based autograd vs. tensor-based PyTorch, highlighting the performance gap.

While not intended for production use, MicroGPT is an invaluable resource for anyone seeking a deep, hands-on understanding of what makes LLMs tick.

FAQ

Q: How does MicroGPT's custom autograd engine differ algorithmically from PyTorch's backward()?

A: Algorithmically, MicroGPT's Value class and its backward() method are identical to PyTorch's automatic differentiation. Both rely on constructing a computation graph and applying the chain rule in reverse topological order. The primary difference is in implementation: MicroGPT's Value objects handle single scalar numbers and their operations, whereas PyTorch's tensors operate on arrays of numbers, leveraging highly optimized C++ backends for vastly superior efficiency.

Q: What is the significance of the BOS (Beginning of Sequence) token in MicroGPT's tokenizer?

A: The BOS token serves as a crucial delimiter, marking the start and end of each document (name) in the dataset. By wrapping each name with BOS tokens (e.g., [BOS, e, m, m, a, BOS]), the model learns that BOS initiates a new sequence and signifies its conclusion. This helps the model generate complete, coherent names rather than just an endless stream of characters, as it learns the statistical likelihood of BOS appearing at certain points.

Q: Why does MicroGPT use RMSNorm instead of the LayerNorm found in the original GPT-2?

A: MicroGPT uses RMSNorm (Root Mean Square Normalization) primarily for simplification. RMSNorm is a less complex variant of LayerNorm, achieving a similar goal of stabilizing neural network activations by rescaling a vector so its values have a unit root-mean-square. It helps prevent activations from exploding or vanishing during training, contributing to a more stable learning process, while being simpler to implement than LayerNorm.

Demystifying LLMs: An In-Depth Look at Karpathy's MicroGPT — Key

What MicroGPT Encompasses

The Dataset: Fueling the Model

Let there be an input dataset `docs`: list[str] of documents (e.g. a dataset of names)

The Tokenizer: Bridging Text and Numbers

Autograd: The Engine of Learning

Parameters: The Model's Knowledge

Architecture: A Simplified GPT-2

Practical Takeaways

FAQ

Related articles

Meta's AI Engine Propels New App Development Surge

Mastering Agentic AI: Building Autonomous Workflows with LangGraph

in-depth: Mac Mini Availability: Long Waits and Higher Prices: Apple

ReFrame: An Open-Source EPaper Camera for Deliberate Photography

Amazon's Next Pillar: Custom AI Chips and Developer Implications

in-depth: Is the Electric Trike the Next Big Thing in Shared

What MicroGPT Encompasses

The Dataset: Fueling the Model

Let there be an input dataset docs: list[str] of documents (e.g. a dataset of names)

The Tokenizer: Bridging Text and Numbers

Autograd: The Engine of Learning

Parameters: The Model's Knowledge

Architecture: A Simplified GPT-2

Practical Takeaways

FAQ

Related articles

Meta's AI Engine Propels New App Development Surge

Mastering Agentic AI: Building Autonomous Workflows with LangGraph

in-depth: Mac Mini Availability: Long Waits and Higher Prices: Apple

ReFrame: An Open-Source EPaper Camera for Deliberate Photography

Amazon's Next Pillar: Custom AI Chips and Developer Implications

in-depth: Is the Electric Trike the Next Big Thing in Shared

Let there be an input dataset `docs`: list[str] of documents (e.g. a dataset of names)