Neural Networks from Scratch: Understanding the Fundamentals

Forget TensorFlow and PyTorch for a moment. Let's build a neural network using nothing but Python and NumPy, understanding every single line of mathematics and code that makes intelligence emerge from numbers.

There's something almost magical about the moment a neural network starts learning. You initialise some random numbers, feed in data, and gradually those numbers arrange themselves into patterns that can recognise faces, translate languages, or predict stock prices. But that magic becomes far more profound when you understand exactly how it happens.

Most tutorials hand you TensorFlow or PyTorch and teach you to call high-level functions. That's practical, but it leaves a gap in understanding. When things go wrong, and they will, you're left debugging a black box. Today we're going to build a neural network from absolute fundamentals. By the end, you'll understand not just what each line does, but why it works.

What Is a Neural Network, Really?

Strip away the biological metaphors and the intimidating mathematics, and a neural network is fundamentally a function. It takes some input, transforms it through a series of operations, and produces an output. What makes it special is that the parameters controlling those transformations can be learned from data.

Think of it like this: imagine you had a complicated formula with thousands of adjustable knobs. Each knob changes how the formula behaves slightly. Given enough examples of inputs and their correct outputs, we can slowly adjust those knobs until the formula gives the right answer most of the time. That's neural network training in a nutshell.

The "neurons" are just numbers. The "connections" are just multiplications. The "learning" is just calculus. Once you see it that way, the mystery dissolves into elegant mathematics.

The Building Blocks

Let's start with the core components. A neural network consists of layers, and each layer performs a simple operation: multiply inputs by weights, add biases, and apply a non-linear activation function.

output = activation(inputs × weights + bias)

The weights and biases are the learnable parameters. The activation function introduces non-linearity, which is crucial because without it, stacking multiple layers would just collapse into a single linear transformation. You could have a hundred layers and they'd all simplify to one matrix multiplication. The non-linearity is what gives deep networks their power to learn complex patterns.

Let's start coding. First, we need to import NumPy and set up some helper functions.

Python
import numpy as np # Set random seed for reproducibility np.random.seed(42) def sigmoid(x): """Sigmoid activation: squashes values to (0, 1)""" return 1 / (1 + np.exp(-x)) def sigmoid_derivative(x): """Derivative of sigmoid, used in backpropagation""" return x * (1 - x) def relu(x): """ReLU activation: max(0, x)""" return np.maximum(0, x) def relu_derivative(x): """Derivative of ReLU""" return (x > 0).astype(float)

The sigmoid function is historically important and still useful for output layers in binary classification. It squashes any input to a value between 0 and 1, which can be interpreted as a probability. ReLU (Rectified Linear Unit) has become the default for hidden layers because it's computationally efficient and helps avoid the vanishing gradient problem we'll discuss later.

Building the Network Class

Now let's create a proper neural network class. We'll build something flexible that can handle any number of layers with any number of neurons. The key insight here is that each layer is just a matrix of weights connecting every neuron in the previous layer to every neuron in the current layer.

Python
class NeuralNetwork: def __init__(self, layer_sizes): """ Initialize neural network with given layer sizes. layer_sizes: list of integers, e.g., [784, 128, 64, 10] """ self.layer_sizes = layer_sizes self.num_layers = len(layer_sizes) # Initialize weights with He initialization # This helps with training deep networks self.weights = [] self.biases = [] for i in range(self.num_layers - 1): # He initialization: scale by sqrt(2/n) w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i]) b = np.zeros((1, layer_sizes[i+1])) self.weights.append(w) self.biases.append(b)

Notice the weight initialisation. I'm using He initialisation, which scales the random weights by the square root of 2 divided by the number of input neurons. This seemingly arbitrary choice actually matters enormously. If weights are too large, activations explode through the network. Too small, and gradients vanish to nothing. He initialisation keeps things in a reasonable range, especially when using ReLU activations.

Forward Propagation

Forward propagation is the easy part. We simply pass the input through each layer, storing the activations as we go. We need to store these because backpropagation requires them.

Python
def forward(self, X): """ Forward pass through the network. Returns the output and stores activations for backprop. """ self.activations = [X] self.z_values = [] # Pre-activation values current = X for i in range(self.num_layers - 1): # Linear transformation z = np.dot(current, self.weights[i]) + self.biases[i] self.z_values.append(z) # Activation (ReLU for hidden, sigmoid for output) if i == self.num_layers - 2: current = sigmoid(z) # Output layer else: current = relu(z) # Hidden layers self.activations.append(current) return current

For each layer, we compute z = Wx + b (the weighted sum plus bias), then apply the activation function. The final layer uses sigmoid to produce values between 0 and 1, while hidden layers use ReLU. We store both the pre-activation values (z) and the post-activation values because we'll need both for computing gradients.

The Heart of Learning: Backpropagation

This is where the real magic happens. Backpropagation is simply the chain rule of calculus applied systematically to compute how much each weight contributed to the error. Once we know that, we can adjust weights to reduce the error.

The intuition is this: the output has some error. That error came from the last layer's weights. But those activations came from the previous layer's weights. And so on, back through the network. We propagate the "blame" backwards, hence the name.

Python
def backward(self, X, y, learning_rate=0.01): """ Backpropagation to compute gradients and update weights. """ m = X.shape[0] # Number of examples # Compute output error output = self.activations[-1] error = output - y # Derivative of cross-entropy + sigmoid # Backpropagate through layers deltas = [error] for i in range(self.num_layers - 2, 0, -1): # Compute delta for this layer delta = np.dot(deltas[-1], self.weights[i].T) delta *= relu_derivative(self.activations[i]) deltas.append(delta) # Reverse to get correct order deltas = deltas[::-1] # Update weights and biases for i in range(self.num_layers - 1): # Gradient is activation.T dot delta weight_gradient = np.dot(self.activations[i].T, deltas[i]) / m bias_gradient = np.mean(deltas[i], axis=0, keepdims=True) # Gradient descent update self.weights[i] -= learning_rate * weight_gradient self.biases[i] -= learning_rate * bias_gradient

The key insight in backpropagation is how the delta (error signal) flows backwards. For the output layer, it's simply the difference between prediction and truth. For hidden layers, we take the delta from the layer above, multiply by the weights connecting them (which distributes the blame), then multiply by the derivative of the activation function (which accounts for how much each neuron contributed).

Why the Derivative?

The activation function's derivative tells us how sensitive the output is to changes in input. If the derivative is large, small changes in weights will have big effects. If it's near zero (as happens with sigmoid for very large or small inputs), the gradient "vanishes" and learning stalls. This is the infamous vanishing gradient problem, and it's why ReLU became popular — its derivative is either 0 or 1, never vanishing for positive inputs.

Training Loop

With forward and backward passes implemented, training is straightforward. We repeatedly show the network examples, compute how wrong it is, and adjust weights to be less wrong next time.

Python
def train(self, X, y, epochs=1000, learning_rate=0.01, verbose=True): """ Train the network on given data. """ history = [] for epoch in range(epochs): # Forward pass output = self.forward(X) # Compute loss (binary cross-entropy) loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8)) history.append(loss) # Backward pass self.backward(X, y, learning_rate) if verbose and epoch % 100 == 0: accuracy = np.mean((output > 0.5) == y) print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}") return history

The loss function measures how wrong we are. Binary cross-entropy is the standard choice for classification because it has nice mathematical properties — it penalises confident wrong answers heavily, and its gradient works perfectly with sigmoid outputs. The small epsilon (1e-8) prevents taking the log of zero, which would give infinity.

Putting It All Together

Let's test our network on a classic problem: XOR. This is actually historically significant. In 1969, Minsky and Papert showed that single-layer networks (perceptrons) cannot learn XOR, which nearly killed neural network research. But with hidden layers, it's trivial.

Python
# XOR dataset X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([[0], [1], [1], [0]]) # Create network: 2 inputs, 4 hidden neurons, 1 output nn = NeuralNetwork([2, 4, 1]) # Train history = nn.train(X, y, epochs=5000, learning_rate=0.5) # Test predictions = nn.forward(X) print("\nFinal predictions:") for i in range(len(X)): print(f"{X[i]} -> {predictions[i][0]:.4f} (expected {y[i][0]})")

Run this code and you'll see the network learn XOR perfectly. The loss decreases, accuracy climbs to 100%, and the final predictions are close to the expected values. We've built a working neural network from scratch.

What We've Actually Built

Let's step back and appreciate what just happened. With about 100 lines of code and nothing but basic linear algebra, we've created a system that learns from examples. No rules were programmed — the network discovered how to solve XOR by adjusting its weights based on feedback.

This is the core of all deep learning. Everything else is optimisation and scale. Convolutional networks add structure that's good for images. Recurrent networks add memory for sequences. Transformers add attention mechanisms. But the fundamental loop — forward pass, compute loss, backward pass, update weights — remains the same.

    INPUT           HIDDEN           OUTPUT

    (x₁) ────┬────→ (h₁) ────┬────→ (y)
             │       ↑        │        ↑
             │       │        │        │
             ├─→─────┤        ├────────┤
             │       │        │        │
             │       │        │        │
    (x₂) ────┴────→ (h₂) ────┴────────┘

         weights₁        weights₂

Going Further

This implementation is intentionally simple. Production neural networks include many enhancements: batch normalisation to stabilise training, dropout for regularisation, Adam optimiser for better convergence, mini-batch training for efficiency on large datasets. But every one of these is an addition to the foundation we've built, not a replacement for it.

If you want to extend this code, try adding momentum to the gradient updates. Momentum accumulates gradients over time, which helps the optimiser navigate ravines in the loss landscape and speeds up convergence significantly. It's a small change with a big impact.

Or try building a network that recognises handwritten digits using the MNIST dataset. You'll need to handle 784 inputs (28x28 pixel images), but the architecture is identical. The network we've built can scale to this problem without any changes to the core logic.

Understanding neural networks at this level changes how you think about AI. The mystique dissolves, replaced by appreciation for elegant mathematics. When you encounter a complex architecture like GPT or Stable Diffusion, you can trace it back to these fundamentals. Forward pass, loss, backpropagation, gradient descent. Everything else is refinement.

The code from this article is a starting point. Experiment with it. Break it. Add to it. That's how understanding deepens — not from reading, but from doing. The mathematics only makes sense when you've watched the numbers dance.

Link copied to clipboard