Wrapping Your Head Around Backpropagation

Caleb BradshawApril 2, 20254 min read

ML TheoryPythonAlgorithmsOptimization

Wrapping Your Head Around Backpropagation

Backpropagation is one of the most important concepts in modern machine learning, especially for training neural networks. In this guide, we’ll break it down step-by-step, demystify the math, and implement a basic version in Python so you can see it working in action.

What is Backpropagation?

Backpropagation is an algorithm used in updating the weights of a neural network by propagating the error backward from the output layer to the input layer. The goal is to minimize a loss function — a measure of how far off the model's predictions are from the true values — by computing the gradients of the loss with respect to each weight and adjusting the weights in the opposite direction of the gradient.

Think of it as the chain rule of calculus applied repeatedly across all layers of the network.

The Building Blocks

To understand backpropagation, you need a grasp of:

Forward pass: Computing the output from inputs through the network
Loss function: Measuring the error of the prediction
Backward pass: Using gradients to update weights

Let’s go through each of these.

1. The Forward Pass

In a basic neural network with one hidden layer:

Input → [Linear Transformation] → Hidden Layer → [Activation] → Output → Loss

For a single data point, forward computation might look like:

import numpy as np

# Sample input and true output
x = np.array([[0.5], [0.1]])  # 2x1 input
y_true = np.array([[1.0]])    # 1x1 true output

# Initialize weights and biases
W1 = np.random.randn(3, 2)    # hidden layer weights (3 neurons, 2 inputs)
b1 = np.random.randn(3, 1)
W2 = np.random.randn(1, 3)    # output layer weights (1 output, 3 hidden neurons)
b2 = np.random.randn(1, 1)

# Activation function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forward pass
z1 = W1 @ x + b1
a1 = sigmoid(z1)
z2 = W2 @ a1 + b2
y_pred = sigmoid(z2)

2. The Loss Function

We’ll use Mean Squared Error (MSE) for simplicity:

def mse_loss(y_pred, y_true):
    return 0.5 * np.square(y_pred - y_true).sum()

loss = mse_loss(y_pred, y_true)

The 0.5 is a convenience — it cancels with the derivative of the square in backpropagation.

3. The Backward Pass (Backpropagation)

Now we compute gradients and propagate them backward:

# Derivative of sigmoid
def sigmoid_deriv(z):
    return sigmoid(z) * (1 - sigmoid(z))

# Output layer gradients
dL_dy_pred = y_pred - y_true                         # dL/dŷ
dy_pred_dz2 = sigmoid_deriv(z2)                      # dŷ/dz2
dz2_dW2 = a1.T                                        # dz2/dW2
dz2_da1 = W2.T                                        # dz2/da1

# Gradients for W2 and b2
dL_dz2 = dL_dy_pred * dy_pred_dz2                    # dL/dz2
dL_dW2 = dL_dz2 @ dz2_dW2                            # dL/dW2
dL_db2 = dL_dz2                                      # dL/db2

# Hidden layer gradients
da1_dz1 = sigmoid_deriv(z1)
dL_dz1 = dz2_da1 @ dL_dz2 * da1_dz1                  # dL/dz1
dL_dW1 = dL_dz1 @ x.T                                # dL/dW1
dL_db1 = dL_dz1                                      # dL/db1

4. Updating the Weights

Using a learning rate to adjust the weights:

lr = 0.1

W2 -= lr * dL_dW2
b2 -= lr * dL_db2
W1 -= lr * dL_dW1
b1 -= lr * dL_db1

You’ve just performed one step of training using backpropagation!

Putting It Together in a Training Loop

Here’s a full training loop to see this in action:

for epoch in range(1000):
    # Forward pass
    z1 = W1 @ x + b1
    a1 = sigmoid(z1)
    z2 = W2 @ a1 + b2
    y_pred = sigmoid(z2)
    loss = mse_loss(y_pred, y_true)

    # Backward pass
    dL_dy_pred = y_pred - y_true
    dy_pred_dz2 = sigmoid_deriv(z2)
    dL_dz2 = dL_dy_pred * dy_pred_dz2
    dL_dW2 = dL_dz2 @ a1.T
    dL_db2 = dL_dz2
    dL_dz1 = W2.T @ dL_dz2 * sigmoid_deriv(z1)
    dL_dW1 = dL_dz1 @ x.T
    dL_db1 = dL_dz1

    # Update weights
    W2 -= lr * dL_dW2
    b2 -= lr * dL_db2
    W1 -= lr * dL_dW1
    b1 -= lr * dL_db1

    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

Wrapping Up

Backpropagation isn't magic — it’s just calculus, matrix multiplication, and bookkeeping. Once you understand the chain rule and how to compute gradients layer by layer, the whole thing becomes a mechanical process. Most libraries like PyTorch and TensorFlow handle it automatically with autograd, but knowing how it works helps you debug, optimize, and truly understand what your model is doing.

Next steps:

Try implementing backpropagation for a network with ReLU instead of sigmoid
Extend it to multiple inputs (batch training)
Build a training loop for classification using cross-entropy loss

You don’t need to memorize the math — you just need to understand the flow.

Happy backpropagating.