Neural Networks from Scratch

There's something almost magical about the moment a neural network starts learning. You initialise some random numbers, feed in data, and gradually those numbers arrange themselves into patterns that can recognise faces, translate languages, or predict stock prices. But that magic becomes far more profound when you understand exactly how it happens.

Most tutorials hand you TensorFlow or PyTorch and teach you to call high-level functions. That's practical, but it leaves a gap in understanding. When things go wrong, and they will, you're left debugging a black box. Today we're going to build a neural network from absolute fundamentals. By the end, you'll understand not just what each line does, but why it works.

What Is a Neural Network, Really?

Strip away the biological metaphors and the intimidating mathematics, and a neural network is fundamentally a function. It takes some input, transforms it through a series of operations, and produces an output. What makes it special is that the parameters controlling those transformations can be learned from data.

Think of it like this: imagine you had a complicated formula with thousands of adjustable knobs. Each knob changes how the formula behaves slightly. Given enough examples of inputs and their correct outputs, we can slowly adjust those knobs until the formula gives the right answer most of the time. That's neural network training in a nutshell.

The "neurons" are just numbers. The "connections" are just multiplications. The "learning" is just calculus. Once you see it that way, the mystery dissolves into elegant mathematics.

The Building Blocks

Let's start with the core components. A neural network consists of layers, and each layer performs a simple operation: multiply inputs by weights, add biases, and apply a non-linear activation function.

output = activation(inputs × weights + bias)

The weights and biases are the learnable parameters. The activation function introduces non-linearity, which is crucial because without it, stacking multiple layers would just collapse into a single linear transformation. You could have a hundred layers and they'd all simplify to one matrix multiplication. The non-linearity is what gives deep networks their power to learn complex patterns.

Let's start coding. First, we need to import NumPy and set up some helper functions.

Python

import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

def sigmoid(x):
    """Sigmoid activation: squashes values to (0, 1)"""
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """Derivative of sigmoid, used in backpropagation"""
    return x * (1 - x)

def relu(x):
    """ReLU activation: max(0, x)"""
    return np.maximum(0, x)

def relu_derivative(x):
    """Derivative of ReLU"""
    return (x > 0).astype(float)
            

The sigmoid function is historically important and still useful for output layers in binary classification. It squashes any input to a value between 0 and 1, which can be interpreted as a probability. ReLU (Rectified Linear Unit) has become the default for hidden layers because it's computationally efficient and helps avoid the vanishing gradient problem we'll discuss later.

Building the Network Class

Now let's create a proper neural network class. We'll build something flexible that can handle any number of layers with any number of neurons. The key insight here is that each layer is just a matrix of weights connecting every neuron in the previous layer to every neuron in the current layer.

Python

class NeuralNetwork:
    def __init__(self, layer_sizes):
        """
        Initialize neural network with given layer sizes.
        layer_sizes: list of integers, e.g., [784, 128, 64, 10]
        """
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes)

        # Initialize weights with He initialization
        # This helps with training deep networks
        self.weights = []
        self.biases = []

        for i in range(self.num_layers - 1):
            # He initialization: scale by sqrt(2/n)
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(w)
            self.biases.append(b)
            

Notice the weight initialisation. I'm using He initialisation, which scales the random weights by the square root of 2 divided by the number of input neurons. This seemingly arbitrary choice actually matters enormously. If weights are too large, activations explode through the network. Too small, and gradients vanish to nothing. He initialisation keeps things in a reasonable range, especially when using ReLU activations.

Forward Propagation

Forward propagation is the easy part. We simply pass the input through each layer, storing the activations as we go. We need to store these because backpropagation requires them.

Python

def forward(self, X):
    """
    Forward pass through the network.
    Returns the output and stores activations for backprop.
    """
    self.activations = [X]
    self.z_values = []  # Pre-activation values

    current = X
    for i in range(self.num_layers - 1):
        # Linear transformation
        z = np.dot(current, self.weights[i]) + self.biases[i]
        self.z_values.append(z)

        # Activation (ReLU for hidden, sigmoid for output)
        if i == self.num_layers - 2:
            current = sigmoid(z)  # Output layer
        else:
            current = relu(z)     # Hidden layers

        self.activations.append(current)

    return current
            

For each layer, we compute z = Wx + b (the weighted sum plus bias), then apply the activation function. The final layer uses sigmoid to produce values between 0 and 1, while hidden layers use ReLU. We store both the pre-activation values (z) and the post-activation values because we'll need both for computing gradients.

The Heart of Learning: Backpropagation

This is where the real magic happens. Backpropagation is simply the chain rule of calculus applied systematically to compute how much each weight contributed to the error. Once we know that, we can adjust weights to reduce the error.

The intuition is this: the output has some error. That error came from the last layer's weights. But those activations came from the previous layer's weights. And so on, back through the network. We propagate the "blame" backwards, hence the name.

Python

def backward(self, X, y, learning_rate=0.01):
    """
    Backpropagation to compute gradients and update weights.
    """
    m = X.shape[0]  # Number of examples

    # Compute output error
    output = self.activations[-1]
    error = output - y  # Derivative of cross-entropy + sigmoid

    # Backpropagate through layers
    deltas = [error]

    for i in range(self.num_layers - 2, 0, -1):
        # Compute delta for this layer
        delta = np.dot(deltas[-1], self.weights[i].T)
        delta *= relu_derivative(self.activations[i])
        deltas.append(delta)

    # Reverse to get correct order
    deltas = deltas[::-1]

    # Update weights and biases
    for i in range(self.num_layers - 1):
        # Gradient is activation.T dot delta
        weight_gradient = np.dot(self.activations[i].T, deltas[i]) / m
        bias_gradient = np.mean(deltas[i], axis=0, keepdims=True)

        # Gradient descent update
        self.weights[i] -= learning_rate * weight_gradient
        self.biases[i] -= learning_rate * bias_gradient
            

The key insight in backpropagation is how the delta (error signal) flows backwards. For the output layer, it's simply the difference between prediction and truth. For hidden layers, we take the delta from the layer above, multiply by the weights connecting them (which distributes the blame), then multiply by the derivative of the activation function (which accounts for how much each neuron contributed).

Why the Derivative?

The activation function's derivative tells us how sensitive the output is to changes in input. If the derivative is large, small changes in weights will have big effects. If it's near zero (as happens with sigmoid for very large or small inputs), the gradient "vanishes" and learning stalls. This is the infamous vanishing gradient problem, and it's why ReLU became popular — its derivative is either 0 or 1, never vanishing for positive inputs.

Training Loop

With forward and backward passes implemented, training is straightforward. We repeatedly show the network examples, compute how wrong it is, and adjust weights to be less wrong next time.

Python

def train(self, X, y, epochs=1000, learning_rate=0.01, verbose=True):
    """
    Train the network on given data.
    """
    history = []

    for epoch in range(epochs):
        # Forward pass
        output = self.forward(X)

        # Compute loss (binary cross-entropy)
        loss = -np.mean(y * np.log(output + 1e-8) +
                       (1 - y) * np.log(1 - output + 1e-8))
        history.append(loss)

        # Backward pass
        self.backward(X, y, learning_rate)

        if verbose and epoch % 100 == 0:
            accuracy = np.mean((output > 0.5) == y)
            print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}")

    return history
            

The loss function measures how wrong we are. Binary cross-entropy is the standard choice for classification because it has nice mathematical properties — it penalises confident wrong answers heavily, and its gradient works perfectly with sigmoid outputs. The small epsilon (1e-8) prevents taking the log of zero, which would give infinity.

Putting It All Together

Let's test our network on a classic problem: XOR. This is actually historically significant. In 1969, Minsky and Papert showed that single-layer networks (perceptrons) cannot learn XOR, which nearly killed neural network research. But with hidden layers, it's trivial.

Python

# XOR dataset
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([[0],
              [1],
              [1],
              [0]])

# Create network: 2 inputs, 4 hidden neurons, 1 output
nn = NeuralNetwork([2, 4, 1])

# Train
history = nn.train(X, y, epochs=5000, learning_rate=0.5)

# Test
predictions = nn.forward(X)
print("\nFinal predictions:")
for i in range(len(X)):
    print(f"{X[i]} -> {predictions[i][0]:.4f} (expected {y[i][0]})")
            

Run this code and you'll see the network learn XOR perfectly. The loss decreases, accuracy climbs to 100%, and the final predictions are close to the expected values. We've built a working neural network from scratch.

What We've Actually Built

Let's step back and appreciate what just happened. With about 100 lines of code and nothing but basic linear algebra, we've created a system that learns from examples. No rules were programmed — the network discovered how to solve XOR by adjusting its weights based on feedback.

This is the core of all deep learning. Everything else is optimisation and scale. Convolutional networks add structure that's good for images. Recurrent networks add memory for sequences. Transformers add attention mechanisms. But the fundamental loop — forward pass, compute loss, backward pass, update weights — remains the same.

    INPUT           HIDDEN           OUTPUT

    (x₁) ────┬────→ (h₁) ────┬────→ (y)
             │       ↑        │        ↑
             │       │        │        │
             ├─→─────┤        ├────────┤
             │       │        │        │
             │       │        │        │
    (x₂) ────┴────→ (h₂) ────┴────────┘

         weights₁        weights₂

Going Further

This implementation is intentionally simple. Production neural networks include many enhancements: batch normalisation to stabilise training, dropout for regularisation, Adam optimiser for better convergence, mini-batch training for efficiency on large datasets. But every one of these is an addition to the foundation we've built, not a replacement for it.

If you want to extend this code, try adding momentum to the gradient updates. Momentum accumulates gradients over time, which helps the optimiser navigate ravines in the loss landscape and speeds up convergence significantly. It's a small change with a big impact.

Or try building a network that recognises handwritten digits using the MNIST dataset. You'll need to handle 784 inputs (28x28 pixel images), but the architecture is identical. The network we've built can scale to this problem without any changes to the core logic.

Understanding neural networks at this level changes how you think about AI. The mystique dissolves, replaced by appreciation for elegant mathematics. When you encounter a complex architecture like GPT or Stable Diffusion, you can trace it back to these fundamentals. Forward pass, loss, backpropagation, gradient descent. Everything else is refinement.

The code from this article is a starting point. Experiment with it. Break it. Add to it. That's how understanding deepens — not from reading, but from doing. The mathematics only makes sense when you've watched the numbers dance.