There's something almost magical about the moment a neural network starts learning. You initialise some random numbers, feed in data, and gradually those numbers arrange themselves into patterns that can recognise faces, translate languages, or predict stock prices. But that magic becomes far more profound when you understand exactly how it happens.
Most tutorials hand you TensorFlow or PyTorch and teach you to call high-level functions. That's practical, but it leaves a gap in understanding. When things go wrong, and they will, you're left debugging a black box. Today we're going to build a neural network from absolute fundamentals. By the end, you'll understand not just what each line does, but why it works.
What Is a Neural Network, Really?
Strip away the biological metaphors and the intimidating mathematics, and a neural network is fundamentally a function. It takes some input, transforms it through a series of operations, and produces an output. What makes it special is that the parameters controlling those transformations can be learned from data.
Think of it like this: imagine you had a complicated formula with thousands of adjustable knobs. Each knob changes how the formula behaves slightly. Given enough examples of inputs and their correct outputs, we can slowly adjust those knobs until the formula gives the right answer most of the time. That's neural network training in a nutshell.
The "neurons" are just numbers. The "connections" are just multiplications. The "learning" is just calculus. Once you see it that way, the mystery dissolves into elegant mathematics.
The Building Blocks
Let's start with the core components. A neural network consists of layers, and each layer performs a simple operation: multiply inputs by weights, add biases, and apply a non-linear activation function.
The weights and biases are the learnable parameters. The activation function introduces non-linearity, which is crucial because without it, stacking multiple layers would just collapse into a single linear transformation. You could have a hundred layers and they'd all simplify to one matrix multiplication. The non-linearity is what gives deep networks their power to learn complex patterns.
Let's start coding. First, we need to import NumPy and set up some helper functions.
Pythonimport numpy as np
# Set random seed for reproducibility
np.random.seed(42)
def sigmoid(x):
"""Sigmoid activation: squashes values to (0, 1)"""
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
"""Derivative of sigmoid, used in backpropagation"""
return x * (1 - x)
def relu(x):
"""ReLU activation: max(0, x)"""
return np.maximum(0, x)
def relu_derivative(x):
"""Derivative of ReLU"""
return (x > 0).astype(float)
The sigmoid function is historically important and still useful for output layers in binary classification. It squashes any input to a value between 0 and 1, which can be interpreted as a probability. ReLU (Rectified Linear Unit) has become the default for hidden layers because it's computationally efficient and helps avoid the vanishing gradient problem we'll discuss later.
Building the Network Class
Now let's create a proper neural network class. We'll build something flexible that can handle any number of layers with any number of neurons. The key insight here is that each layer is just a matrix of weights connecting every neuron in the previous layer to every neuron in the current layer.
Pythonclass NeuralNetwork:
def __init__(self, layer_sizes):
"""
Initialize neural network with given layer sizes.
layer_sizes: list of integers, e.g., [784, 128, 64, 10]
"""
self.layer_sizes = layer_sizes
self.num_layers = len(layer_sizes)
# Initialize weights with He initialization
# This helps with training deep networks
self.weights = []
self.biases = []
for i in range(self.num_layers - 1):
# He initialization: scale by sqrt(2/n)
w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
b = np.zeros((1, layer_sizes[i+1]))
self.weights.append(w)
self.biases.append(b)
Notice the weight initialisation. I'm using He initialisation, which scales the random weights by the square root of 2 divided by the number of input neurons. This seemingly arbitrary choice actually matters enormously. If weights are too large, activations explode through the network. Too small, and gradients vanish to nothing. He initialisation keeps things in a reasonable range, especially when using ReLU activations.
Forward Propagation
Forward propagation is the easy part. We simply pass the input through each layer, storing the activations as we go. We need to store these because backpropagation requires them.
Pythondef forward(self, X):
"""
Forward pass through the network.
Returns the output and stores activations for backprop.
"""
self.activations = [X]
self.z_values = [] # Pre-activation values
current = X
for i in range(self.num_layers - 1):
# Linear transformation
z = np.dot(current, self.weights[i]) + self.biases[i]
self.z_values.append(z)
# Activation (ReLU for hidden, sigmoid for output)
if i == self.num_layers - 2:
current = sigmoid(z) # Output layer
else:
current = relu(z) # Hidden layers
self.activations.append(current)
return current
For each layer, we compute z = Wx + b (the weighted sum plus bias), then apply the activation function. The final layer uses sigmoid to produce values between 0 and 1, while hidden layers use ReLU. We store both the pre-activation values (z) and the post-activation values because we'll need both for computing gradients.
The Heart of Learning: Backpropagation
This is where the real magic happens. Backpropagation is simply the chain rule of calculus applied systematically to compute how much each weight contributed to the error. Once we know that, we can adjust weights to reduce the error.
The intuition is this: the output has some error. That error came from the last layer's weights. But those activations came from the previous layer's weights. And so on, back through the network. We propagate the "blame" backwards, hence the name.
Pythondef backward(self, X, y, learning_rate=0.01):
"""
Backpropagation to compute gradients and update weights.
"""
m = X.shape[0] # Number of examples
# Compute output error
output = self.activations[-1]
error = output - y # Derivative of cross-entropy + sigmoid
# Backpropagate through layers
deltas = [error]
for i in range(self.num_layers - 2, 0, -1):
# Compute delta for this layer
delta = np.dot(deltas[-1], self.weights[i].T)
delta *= relu_derivative(self.activations[i])
deltas.append(delta)
# Reverse to get correct order
deltas = deltas[::-1]
# Update weights and biases
for i in range(self.num_layers - 1):
# Gradient is activation.T dot delta
weight_gradient = np.dot(self.activations[i].T, deltas[i]) / m
bias_gradient = np.mean(deltas[i], axis=0, keepdims=True)
# Gradient descent update
self.weights[i] -= learning_rate * weight_gradient
self.biases[i] -= learning_rate * bias_gradient
The key insight in backpropagation is how the delta (error signal) flows backwards. For the output layer, it's simply the difference between prediction and truth. For hidden layers, we take the delta from the layer above, multiply by the weights connecting them (which distributes the blame), then multiply by the derivative of the activation function (which accounts for how much each neuron contributed).
The activation function's derivative tells us how sensitive the output is to changes in input. If the derivative is large, small changes in weights will have big effects. If it's near zero (as happens with sigmoid for very large or small inputs), the gradient "vanishes" and learning stalls. This is the infamous vanishing gradient problem, and it's why ReLU became popular — its derivative is either 0 or 1, never vanishing for positive inputs.
Training Loop
With forward and backward passes implemented, training is straightforward. We repeatedly show the network examples, compute how wrong it is, and adjust weights to be less wrong next time.
Pythondef train(self, X, y, epochs=1000, learning_rate=0.01, verbose=True):
"""
Train the network on given data.
"""
history = []
for epoch in range(epochs):
# Forward pass
output = self.forward(X)
# Compute loss (binary cross-entropy)
loss = -np.mean(y * np.log(output + 1e-8) +
(1 - y) * np.log(1 - output + 1e-8))
history.append(loss)
# Backward pass
self.backward(X, y, learning_rate)
if verbose and epoch % 100 == 0:
accuracy = np.mean((output > 0.5) == y)
print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}")
return history
The loss function measures how wrong we are. Binary cross-entropy is the standard choice for classification because it has nice mathematical properties — it penalises confident wrong answers heavily, and its gradient works perfectly with sigmoid outputs. The small epsilon (1e-8) prevents taking the log of zero, which would give infinity.
Putting It All Together
Let's test our network on a classic problem: XOR. This is actually historically significant. In 1969, Minsky and Papert showed that single-layer networks (perceptrons) cannot learn XOR, which nearly killed neural network research. But with hidden layers, it's trivial.
Python# XOR dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([[0],
[1],
[1],
[0]])
# Create network: 2 inputs, 4 hidden neurons, 1 output
nn = NeuralNetwork([2, 4, 1])
# Train
history = nn.train(X, y, epochs=5000, learning_rate=0.5)
# Test
predictions = nn.forward(X)
print("\nFinal predictions:")
for i in range(len(X)):
print(f"{X[i]} -> {predictions[i][0]:.4f} (expected {y[i][0]})")
Run this code and you'll see the network learn XOR perfectly. The loss decreases, accuracy climbs to 100%, and the final predictions are close to the expected values. We've built a working neural network from scratch.
What We've Actually Built
Let's step back and appreciate what just happened. With about 100 lines of code and nothing but basic linear algebra, we've created a system that learns from examples. No rules were programmed — the network discovered how to solve XOR by adjusting its weights based on feedback.
This is the core of all deep learning. Everything else is optimisation and scale. Convolutional networks add structure that's good for images. Recurrent networks add memory for sequences. Transformers add attention mechanisms. But the fundamental loop — forward pass, compute loss, backward pass, update weights — remains the same.
INPUT HIDDEN OUTPUT
(x₁) ────┬────→ (h₁) ────┬────→ (y)
│ ↑ │ ↑
│ │ │ │
├─→─────┤ ├────────┤
│ │ │ │
│ │ │ │
(x₂) ────┴────→ (h₂) ────┴────────┘
weights₁ weights₂
Going Further
This implementation is intentionally simple. Production neural networks include many enhancements: batch normalisation to stabilise training, dropout for regularisation, Adam optimiser for better convergence, mini-batch training for efficiency on large datasets. But every one of these is an addition to the foundation we've built, not a replacement for it.
If you want to extend this code, try adding momentum to the gradient updates. Momentum accumulates gradients over time, which helps the optimiser navigate ravines in the loss landscape and speeds up convergence significantly. It's a small change with a big impact.
Or try building a network that recognises handwritten digits using the MNIST dataset. You'll need to handle 784 inputs (28x28 pixel images), but the architecture is identical. The network we've built can scale to this problem without any changes to the core logic.
Understanding neural networks at this level changes how you think about AI. The mystique dissolves, replaced by appreciation for elegant mathematics. When you encounter a complex architecture like GPT or Stable Diffusion, you can trace it back to these fundamentals. Forward pass, loss, backpropagation, gradient descent. Everything else is refinement.
The code from this article is a starting point. Experiment with it. Break it. Add to it. That's how understanding deepens — not from reading, but from doing. The mathematics only makes sense when you've watched the numbers dance.