Building Nanograd: A Deep Dive into Automatic Differentiation

Lakshmi Narasimman

19 Feb 2025 — 5 min read

From First Principles to a Minimal Autograd Engine

1. Introduction & Motivation

Nanograd is a minimal automatic differentiation (autograd) engine that computes gradients for scalar-valued functions. Popularized by Andrej Karpathy, it demonstrates the core mechanics behind frameworks like PyTorch and TensorFlow. This guide constructs Nanograd from scratch, explaining:

Automatic Differentiation (AD): How gradients are computed via backpropagation.
Computational Graphs: Representing mathematical operations as directed acyclic graphs (DAGs).
Dynamic Graph Construction: Tracking operations during forward passes.

2. Foundations of Automatic Differentiation

2.1 What is a Gradient?

For a function \( f: \mathbb{R}^n \rightarrow \mathbb{R} \), the gradient \( \nabla f \) is:
\[ \nabla f = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n} \right]^T \]
Gradients indicate the direction of steepest ascent in \( f \).

2.2 Forward vs. Reverse Mode AD

Forward Mode: Computes directional derivatives by iterating from inputs to outputs.
- Complexity: \( O(n) \) per output.
Reverse Mode (Backpropagation): Computes gradients by traversing from outputs to inputs.
- Complexity: \( O(m) \) per input (favored in deep learning).

2.3 Chain Rule in Backpropagation

For a composite function \( f(g(x)) \), the chain rule states:
\[ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} \]
Reverse-mode AD recursively applies this rule through the computational graph.

3. Core Components of Nanograd

3.1 The `Value` Class

A Value object wraps a scalar and tracks its computational history:

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data          # Scalar value
        self.grad = 0.0           # Gradient (initially 0)
        self._backward = lambda: None  # Backward function
        self._prev = set(_children)    # Child nodes
        self._op = _op            # Operation label (for debugging)

3.2 Computational Graph Representation

Each Value object is a node in a DAG. For an expression \( a + b \), the graph is:

    +
   / \\
  a   b

3.3 Gradient Initialization

All grad attributes are initialized to 0.
The output node’s gradient is set to 1.0 before backpropagation (since \( \frac{df}{df} = 1 \)).

4. Mathematical Operations & Their Gradients

4.1 Addition

def __add__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data + other.data, (self, other), '+')
    
    def _backward():
        self.grad += 1.0 * out.grad  # ∂(a+b)/∂a = 1
        other.grad += 1.0 * out.grad # ∂(a+b)/∂b = 1
    out._backward = _backward
    return out

Gradient Derivation:

\[ \frac{\partial (a + b)}{\partial a} = 1, \quad \frac{\partial (a + b)}{\partial b} = 1 \]

4.2 Multiplication

def __mul__(self, other):
    other = other if isinstance(other, Value) else Value(other)
    out = Value(self.data * other.data, (self, other), '*')
    
    def _backward():
        self.grad += other.data * out.grad  # ∂(a*b)/∂a = b
        other.grad += self.data * out.grad  # ∂(a*b)/∂b = a
    out._backward = _backward
    return out

Gradient Derivation:

\[ \frac{\partial (a \cdot b)}{\partial a} = b, \quad \frac{\partial (a \cdot b)}{\partial b} = a \]

4.3 Activation Functions (Tanh)

def tanh(self):
    x = self.data
    t = (math.exp(2*x) - 1)/(math.exp(2*x) + 1)
    out = Value(t, (self,), 'tanh')
    
    def _backward():
        self.grad += (1 - t**2) * out.grad  # ∂tanh(x)/∂x = 1 - tanh²(x)
    out._backward = _backward
    return out

Gradient Derivation:

\[ \frac{\partial \tanh(x)}{\partial x} = 1 - \tanh^2(x) \]

5. Backpropagation Algorithm

5.1 Topological Sorting

To apply the chain rule correctly, nodes must be processed in reverse topological order:

Forward Pass: Build the computational graph.
Backward Pass:
- Initialize output gradient to 1.0.
- Sort nodes topologically (parents before children).
- Apply _backward() in reverse order.

5.2 Implementing `backward()`

def backward(self):
    topo = []
    visited = set()
    
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    
    self.grad = 1.0
    for v in reversed(topo):
        v._backward()

6. From Scratch Implementation

6.1 Full `Value` Class

import math

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op

    def __add__(self, other):
        # ... (as defined in Section 4.1)

    def __mul__(self, other):
        # ... (as defined in Section 4.2)

    def tanh(self):
        # ... (as defined in Section 4.3)

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

6.2 Testing the Engine

Example 1: Simple Expression

a = Value(2.0)
b = Value(3.0)
c = a * b + a.tanh()
c.backward()
print(a.grad)  # ∂c/∂a = b + (1 - tanh²(a)) ≈ 3 + (1 - 0.964) ≈ 3.036

Example 2: Neural Network Neuron

# Inputs
x1 = Value(0.5)
x2 = Value(-1.0)
w1 = Value(1.0)
w2 = Value(1.0)
b = Value(0.1)

# Forward pass
out = (x1 * w1 + x2 * w2 + b).tanh()
out.backward()

print(w1.grad)  # ∂out/∂w1 = (1 - tanh²(z)) * x1 ≈ (0.75) * 0.5 = 0.375

7. Advanced Topics

7.1 Efficiency Considerations

Scalar Limitation: Nanograd handles scalars, but real frameworks use tensors for vectorization.
Operator Overloading: PyTorch uses C++ extensions for speed; Python is ~100x slower.

7.2 Extending Nanograd

More Operations: Add exp, log, or ReLU.
GPU Support: Implement tensor operations with CUDA.
Gradient Checkpointing: Reduce memory usage by recomputing intermediates.

7.3 Comparison with PyTorch Autograd

8. Mathematical Proofs

8.1 Chain Rule for Composite Functions

For \( f(g(x)) \), the derivative is:
\[ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} \]
Proof:
By definition,
\[ \frac{df}{dx} = \lim_{h \to 0} \frac{f(g(x+h)) - f(g(x))}{h} \]
Multiply and divide by \( g(x+h) - g(x) \):
\[ \frac{df}{dx} = \lim_{h \to 0} \left[ \frac{f(g(x+h)) - f(g(x))}{g(x+h)-g(x)} \cdot \frac{g(x+h)-g(x)}{h} \right] = \frac{df}{dg} \cdot \frac{dg}{dx} \]

9. Applications

9.1 Training a Neural Network

# 2-layer neural network
class Neuron:
    def __init__(self, n_inputs):
        self.w = [Value(np.random.randn()) for _ in range(n_inputs)]
        self.b = Value(np.random.randn())
    
    def __call__(self, x):
        act = sum((xi * wi for xi, wi in zip(x, self.w)), self.b)
        return act.tanh()

# Loss function: Mean Squared Error (MSE)
def loss(y_pred, y_true):
    return sum((yp - yt)**2 for yp, yt in zip(y_pred, y_true))

9.2 Optimizers (SGD)

def sgd(params, lr=0.01):
    for p in params:
        p.data -= lr * p.grad
        p.grad = 0.0  # Reset gradients

10. Limitations & Future Work

10.1 Nanograd Limitations

Scalar-Only: No support for tensors or batches.
No GPU Acceleration: Pure Python implementation.
Basic Ops: Limited to +, *, and tanh.

10.2 Extensions

Tensor Support: Add multi-dimensional arrays.
Autograd for CNNs: Implement convolution and pooling.
Just-In-Time (JIT) Compilation: Optimize with Numba or C++.

11. Conclusion

Nanograd demystifies automatic differentiation by implementing reverse-mode AD in ~100 lines of Python. While simplistic, it captures the essence of modern autograd systems. By extending it with tensor operations, GPU support, and more layers, you can build a production-ready deep learning framework.

References

Appendices

A1: Full Nanograd Code with Comments.
A2: Derivation of Common Activation Gradients (ReLU, Sigmoid).
A3: Benchmarking Nanograd vs. PyTorch.

Building Nanograd: A Deep Dive into Automatic Differentiation

Lakshmi Narasimman

1. Introduction & Motivation

2. Foundations of Automatic Differentiation

2.1 What is a Gradient?

2.2 Forward vs. Reverse Mode AD

2.3 Chain Rule in Backpropagation

3. Core Components of Nanograd

3.1 The `Value` Class

3.2 Computational Graph Representation

3.3 Gradient Initialization

4. Mathematical Operations & Their Gradients

4.1 Addition

4.2 Multiplication

4.3 Activation Functions (Tanh)

5. Backpropagation Algorithm

5.1 Topological Sorting

5.2 Implementing `backward()`

6. From Scratch Implementation

6.1 Full `Value` Class

6.2 Testing the Engine

7. Advanced Topics

7.1 Efficiency Considerations

7.2 Extending Nanograd

7.3 Comparison with PyTorch Autograd

8. Mathematical Proofs

8.1 Chain Rule for Composite Functions

9. Applications

9.1 Training a Neural Network

9.2 Optimizers (SGD)

10. Limitations & Future Work

10.1 Nanograd Limitations

10.2 Extensions

11. Conclusion

References

Read more

Galois Theory: Bridging Algebra and Equations

How DeepSeek-R1 Surpasses OpenAI’s o1: The Surprising Key to Efficiency

Infosys Q3 FY25 Earnings Analysis

Accelerate Machine Learning with Hogwild! Gradient Descent

1. Introduction & Motivation

2. Foundations of Automatic Differentiation

2.1 What is a Gradient?

2.2 Forward vs. Reverse Mode AD

2.3 Chain Rule in Backpropagation

3. Core Components of Nanograd

3.1 The Value Class

3.2 Computational Graph Representation

3.3 Gradient Initialization

4. Mathematical Operations & Their Gradients

4.1 Addition

4.2 Multiplication

4.3 Activation Functions (Tanh)

5. Backpropagation Algorithm

5.1 Topological Sorting

5.2 Implementing backward()

6. From Scratch Implementation

6.1 Full Value Class

6.2 Testing the Engine

7. Advanced Topics

7.1 Efficiency Considerations

7.2 Extending Nanograd

7.3 Comparison with PyTorch Autograd

8. Mathematical Proofs

8.1 Chain Rule for Composite Functions

9. Applications

9.1 Training a Neural Network

9.2 Optimizers (SGD)

10. Limitations & Future Work

10.1 Nanograd Limitations

10.2 Extensions

11. Conclusion

References

Read more

Galois Theory: Bridging Algebra and Equations

How DeepSeek-R1 Surpasses OpenAI’s o1: The Surprising Key to Efficiency

Infosys Q3 FY25 Earnings Analysis

Accelerate Machine Learning with Hogwild! Gradient Descent

3.1 The `Value` Class

5.2 Implementing `backward()`

6.1 Full `Value` Class