Building Nanograd: A Deep Dive into Automatic Differentiation

Building Nanograd: A Deep Dive into Automatic Differentiation
From First Principles to a Minimal Autograd Engine

1. Introduction & Motivation

Nanograd is a minimal automatic differentiation (autograd) engine that computes gradients for scalar-valued functions. Popularized by Andrej Karpathy, it demonstrates the core mechanics behind frameworks like PyTorch and TensorFlow. This guide constructs Nanograd from scratch, explaining:

  1. Automatic Differentiation (AD): How gradients are computed via backpropagation.
  2. Computational Graphs: Representing mathematical operations as directed acyclic graphs (DAGs).
  3. Dynamic Graph Construction: Tracking operations during forward passes.

2. Foundations of Automatic Differentiation

2.1 What is a Gradient?

For a function \( f: \mathbb{R}^n \rightarrow \mathbb{R} \), the gradient \( \nabla f \) is:
\[ \nabla f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n} \right]^T \]
Gradients indicate the direction of steepest ascent in \( f \).

2.2 Forward vs. Reverse Mode AD

  • Forward Mode: Computes directional derivatives by iterating from inputs to outputs.
    • Complexity: \( O(n) \) per output.
  • Reverse Mode (Backpropagation): Computes gradients by traversing from outputs to inputs.
    • Complexity: \( O(m) \) per input (favored in deep learning).

2.3 Chain Rule in Backpropagation

For a composite function \( f(g(x)) \), the chain rule states:
\[ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} \]
Reverse-mode AD recursively applies this rule through the computational graph.


3. Core Components of Nanograd

3.1 The Value Class

A Value object wraps a scalar and tracks its computational history:

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data          # Scalar value
        self.grad = 0.0           # Gradient (initially 0)
        self._backward = lambda: None  # Backward function
        self._prev = set(_children)    # Child nodes
        self._op = _op            # Operation label (for debugging)

3.2 Computational Graph Representation

Each Value object is a node in a DAG. For an expression \( a + b \), the graph is:

    +
   / \\
  a   b

3.3 Gradient Initialization

  • All grad attributes are initialized to 0.
  • The output node’s gradient is set to 1.0 before backpropagation (since \( \frac{df}{df} = 1 \)).

4. Mathematical Operations & Their Gradients

4.1 Addition

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')
    
    def _backward():
        self.grad += 1.0 * out.grad  # ∂(a+b)/∂a = 1
        other.grad += 1.0 * out.grad # ∂(a+b)/∂b = 1
    out._backward = _backward
    return out

Gradient Derivation:

\[ \frac{\partial (a + b)}{\partial a} = 1, \quad \frac{\partial (a + b)}{\partial b} = 1 \]

4.2 Multiplication

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data * other.data, (self, other), '*')
    
    def _backward():
        self.grad += other.data * out.grad  # ∂(a*b)/∂a = b
        other.grad += self.data * out.grad  # ∂(a*b)/∂b = a
    out._backward = _backward
    return out

Gradient Derivation:

\[ \frac{\partial (a \cdot b)}{\partial a} = b, \quad \frac{\partial (a \cdot b)}{\partial b} = a \]

4.3 Activation Functions (Tanh)

def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh')
    
    def _backward():
        self.grad += (1 - t**2) * out.grad  # ∂tanh(x)/∂x = 1 - tanh²(x)
    out._backward = _backward
    return out

Gradient Derivation:

\[ \frac{\partial \tanh(x)}{\partial x} = 1 - \tanh^2(x) \]


5. Backpropagation Algorithm

5.1 Topological Sorting

To apply the chain rule correctly, nodes must be processed in reverse topological order:

  1. Forward Pass: Build the computational graph.
  2. Backward Pass:
    • Initialize output gradient to 1.0.
    • Sort nodes topologically (parents before children).
    • Apply _backward() in reverse order.

5.2 Implementing backward()

def backward(self):
    topo = []
    visited = set()
    
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    
    self.grad = 1.0
    for v in reversed(topo):
        v._backward()

6. From Scratch Implementation

6.1 Full Value Class

import math

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        # ... (as defined in Section 4.1)

    def __mul__(self, other):
        # ... (as defined in Section 4.2)

    def tanh(self):
        # ... (as defined in Section 4.3)

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

6.2 Testing the Engine

Example 1: Simple Expression

a = Value(2.0)
b = Value(3.0)
c = a * b + a.tanh()
c.backward()
print(a.grad)  # ∂c/∂a = b + (1 - tanh²(a)) ≈ 3 + (1 - 0.964) ≈ 3.036

Example 2: Neural Network Neuron

# Inputs
x1 = Value(0.5)
x2 = Value(-1.0)
w1 = Value(1.0)
w2 = Value(1.0)
b = Value(0.1)

# Forward pass
out = (x1 * w1 + x2 * w2 + b).tanh()
out.backward()

print(w1.grad)  # ∂out/∂w1 = (1 - tanh²(z)) * x1 ≈ (0.75) * 0.5 = 0.375

7. Advanced Topics

7.1 Efficiency Considerations

  • Scalar Limitation: Nanograd handles scalars, but real frameworks use tensors for vectorization.
  • Operator Overloading: PyTorch uses C++ extensions for speed; Python is ~100x slower.

7.2 Extending Nanograd

  1. More Operations: Add explog, or ReLU.
  2. GPU Support: Implement tensor operations with CUDA.
  3. Gradient Checkpointing: Reduce memory usage by recomputing intermediates.

7.3 Comparison with PyTorch Autograd


8. Mathematical Proofs

8.1 Chain Rule for Composite Functions

For \( f(g(x)) \), the derivative is:
\[ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} \]
Proof:
By definition,
\[ \frac{df}{dx} = \lim_{h \to 0} \frac{f(g(x+h)) - f(g(x))}{h} \]
Multiply and divide by \( g(x+h) - g(x) \):
\[ \frac{df}{dx} = \lim_{h \to 0} \left[ \frac{f(g(x+h)) - f(g(x))}{g(x+h)-g(x)} \cdot \frac{g(x+h)-g(x)}{h} \right] = \frac{df}{dg} \cdot \frac{dg}{dx} \]


9. Applications

9.1 Training a Neural Network

# 2-layer neural network
class Neuron:
    def __init__(self, n_inputs):
        self.w = [Value(np.random.randn()) for _ in range(n_inputs)]
        self.b = Value(np.random.randn())
    
    def __call__(self, x):
        act = sum((xi * wi for xi, wi in zip(x, self.w)), self.b)
        return act.tanh()

# Loss function: Mean Squared Error (MSE)
def loss(y_pred, y_true):
    return sum((yp - yt)**2 for yp, yt in zip(y_pred, y_true))

9.2 Optimizers (SGD)

def sgd(params, lr=0.01):
    for p in params:
        p.data -= lr * p.grad
        p.grad = 0.0  # Reset gradients

10. Limitations & Future Work

10.1 Nanograd Limitations

  • Scalar-Only: No support for tensors or batches.
  • No GPU Acceleration: Pure Python implementation.
  • Basic Ops: Limited to +*, and tanh.

10.2 Extensions

  1. Tensor Support: Add multi-dimensional arrays.
  2. Autograd for CNNs: Implement convolution and pooling.
  3. Just-In-Time (JIT) Compilation: Optimize with Numba or C++.

11. Conclusion

Nanograd demystifies automatic differentiation by implementing reverse-mode AD in ~100 lines of Python. While simplistic, it captures the essence of modern autograd systems. By extending it with tensor operations, GPU support, and more layers, you can build a production-ready deep learning framework.


References

  1. Karpathy, A. (2020). Micrograd: A tiny scalar-valued autograd engine.
  2. Baydin, A. G. et al. (2018). Automatic Differentiation in Machine Learning: A Survey.
  3. Paszke, A. et al. (2017). Automatic differentiation in PyTorch.

Appendices

  • A1: Full Nanograd Code with Comments.
  • A2: Derivation of Common Activation Gradients (ReLU, Sigmoid).
  • A3: Benchmarking Nanograd vs. PyTorch.

Read more