Understanding Gradient Descent

Imagine you're blindfolded on a hilly landscape, and your goal is to find the lowest valley. You can't see anything, but you can feel the slope beneath your feet. The sensible strategy is simple: take a step in whichever direction goes downhill. Repeat until you can't go down anymore.

That's gradient descent. The landscape is the loss function — a mathematical surface where height represents how wrong your model is. The valley is the optimal solution where error is minimised. And the slope you feel is the gradient, telling you which direction leads downward.

This deceptively simple idea powers virtually all modern machine learning. From spam filters to self-driving cars, from language models to drug discovery algorithms, gradient descent is the common thread. Let's understand it properly.

The Mathematics of Downhill

A gradient is a vector of partial derivatives. If that sounds intimidating, think of it this way: you have a function with multiple inputs, and the gradient tells you how much the output changes when you tweak each input slightly. It's a generalisation of the ordinary derivative you might remember from calculus.

For a function f(x, y), the gradient is:

∇f = (∂f/∂x, ∂f/∂y)

The crucial property of the gradient is that it points in the direction of steepest increase. So if we want to decrease a function (our loss), we move in the opposite direction: the negative gradient. This gives us the fundamental update rule:

θ_new = θ_old - α × ∇L(θ)

Here, θ represents our model's parameters (weights and biases), α is the learning rate (how big a step we take), and ∇L(θ) is the gradient of the loss with respect to those parameters. Each iteration, we move our parameters a little bit in the direction that reduces the loss.

Implementing Basic Gradient Descent

Let's start with a concrete example. We'll find the minimum of a simple function: f(x) = x² + 5x + 6. We know from algebra that the minimum is at x = -2.5, but let's have gradient descent discover this.

Python

import numpy as np

def f(x):
    """Our function to minimize"""
    return x**2 + 5*x + 6

def gradient_f(x):
    """Derivative of f: d/dx(x² + 5x + 6) = 2x + 5"""
    return 2*x + 5

def gradient_descent(starting_point, learning_rate, num_iterations):
    x = starting_point
    history = [x]

    for i in range(num_iterations):
        grad = gradient_f(x)
        x = x - learning_rate * grad
        history.append(x)

        if i % 10 == 0:
            print(f"Iteration {i}: x = {x:.6f}, f(x) = {f(x):.6f}")

    return x, history

# Run gradient descent
minimum, history = gradient_descent(
    starting_point=10.0,
    learning_rate=0.1,
    num_iterations=50
)

print(f"\nFinal result: x = {minimum:.6f}")
print(f"Expected minimum: x = -2.5")
            

Running this code, you'll watch x rapidly converge from 10 to approximately -2.5. The gradient at x = 10 is 2(10) + 5 = 25, which is large and positive, so we move strongly in the negative direction. As we approach the minimum, the gradient shrinks, and our steps become smaller. Eventually, we settle at the bottom.

The Learning Rate Dilemma

The learning rate α is perhaps the most critical hyperparameter in all of machine learning. Too large, and you'll overshoot the minimum, bouncing around chaotically or even diverging to infinity. Too small, and training takes forever, potentially getting stuck in shallow local minima.

Python

# Demonstrating different learning rates

print("Learning rate too small (0.001):")
x, _ = gradient_descent(10.0, 0.001, 50)
print(f"After 50 iterations: x = {x:.4f}\n")

print("Learning rate good (0.1):")
x, _ = gradient_descent(10.0, 0.1, 50)
print(f"After 50 iterations: x = {x:.4f}\n")

print("Learning rate too large (1.1):")
x, _ = gradient_descent(10.0, 1.1, 50)
print(f"After 50 iterations: x = {x:.4f}")
            

With a learning rate of 0.001, progress is agonisingly slow — after 50 iterations, x has barely moved. With 0.1, convergence is rapid and stable. With 1.1, the algorithm oscillates wildly, overshooting the minimum on each step.

Practical Tip

Start with a learning rate around 0.001 to 0.01 for most problems. If training is unstable (loss jumping around), reduce it. If training is too slow (loss barely changing), increase it. Many practitioners use learning rate schedules that start high and decrease over time, getting the best of both worlds.

Stochastic and Mini-Batch Gradient Descent

In the examples above, we computed the gradient using the entire function at once. But real machine learning involves millions of training examples. Computing the gradient over all of them for every single step would be prohibitively slow.

The solution is stochastic gradient descent (SGD). Instead of computing the exact gradient, we estimate it using a random subset (batch) of the training data. This estimate is noisy, but on average it points in the right direction. The noise actually helps — it can bounce us out of shallow local minima that would trap batch gradient descent.

Python

import numpy as np

def sgd_linear_regression(X, y, learning_rate=0.01, epochs=100, batch_size=32):
    """
    Train linear regression using mini-batch SGD.
    """
    n_samples, n_features = X.shape

    # Initialize weights randomly
    w = np.random.randn(n_features)
    b = 0.0

    losses = []

    for epoch in range(epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)
        X_shuffled = X[indices]
        y_shuffled = y[indices]

        epoch_loss = 0

        # Process mini-batches
        for i in range(0, n_samples, batch_size):
            X_batch = X_shuffled[i:i+batch_size]
            y_batch = y_shuffled[i:i+batch_size]

            # Forward pass: predictions
            predictions = np.dot(X_batch, w) + b

            # Compute loss (MSE)
            loss = np.mean((predictions - y_batch) ** 2)
            epoch_loss += loss

            # Compute gradients
            error = predictions - y_batch
            grad_w = (2 / len(X_batch)) * np.dot(X_batch.T, error)
            grad_b = (2 / len(X_batch)) * np.sum(error)

            # Update weights
            w -= learning_rate * grad_w
            b -= learning_rate * grad_b

        losses.append(epoch_loss / (n_samples // batch_size))

    return w, b, losses

# Example usage
np.random.seed(42)
X = np.random.randn(1000, 5)
true_w = np.array([2, -1, 0.5, 3, -0.5])
y = np.dot(X, true_w) + 0.1 * np.random.randn(1000)

w, b, losses = sgd_linear_regression(X, y, learning_rate=0.1, epochs=100)
print(f"Learned weights: {w}")
print(f"True weights:    {true_w}")
            

The mini-batch approach strikes a balance. Pure SGD (batch size of 1) is maximally noisy but allows for very frequent updates. Full-batch gradient descent is stable but slow and memory-intensive. Mini-batches of 32 to 256 samples typically work well, benefiting from vectorised computation while maintaining enough stochasticity to escape local minima.

Momentum: Learning from Physics

Standard gradient descent treats each step independently. But imagine a ball rolling down a hill — it builds up speed, using its momentum to push through small bumps and flat regions. We can add this physics-inspired behaviour to our optimisation.

Momentum accumulates gradients from previous steps, creating a "velocity" that smooths out the optimisation trajectory:

v = β × v + ∇L(θ)
θ = θ - α × v

Here, β is typically 0.9, meaning we retain 90% of our previous velocity. This helps in two ways: it accelerates convergence in consistent directions, and it dampens oscillations in directions where the gradient keeps changing sign.

Python

def gradient_descent_momentum(starting_point, learning_rate, momentum, num_iterations):
    x = starting_point
    v = 0  # Initial velocity
    history = [x]

    for i in range(num_iterations):
        grad = gradient_f(x)

        # Update velocity
        v = momentum * v + grad

        # Update position using velocity
        x = x - learning_rate * v
        history.append(x)

    return x, history

# Compare with and without momentum
x_no_momentum, hist1 = gradient_descent(10.0, 0.05, 30)
x_with_momentum, hist2 = gradient_descent_momentum(10.0, 0.05, 0.9, 30)

print(f"Without momentum: {x_no_momentum:.6f}")
print(f"With momentum:    {x_with_momentum:.6f}")
            

Adam: The Modern Standard

Adam (Adaptive Moment Estimation) combines momentum with another powerful idea: adaptive learning rates for each parameter. Some parameters might need bigger updates, others smaller. Adam tracks both the first moment (mean of gradients, like momentum) and the second moment (variance of gradients) to adjust appropriately.

Python

def adam(gradient_func, x_init, learning_rate=0.001, beta1=0.9, beta2=0.999,
         epsilon=1e-8, num_iterations=100):
    """
    Adam optimizer implementation.
    """
    x = x_init
    m = 0  # First moment
    v = 0  # Second moment

    for t in range(1, num_iterations + 1):
        grad = gradient_func(x)

        # Update biased first moment estimate
        m = beta1 * m + (1 - beta1) * grad

        # Update biased second moment estimate
        v = beta2 * v + (1 - beta2) * (grad ** 2)

        # Bias correction
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)

        # Update parameters
        x = x - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)

    return x

result = adam(gradient_f, 10.0, learning_rate=0.5, num_iterations=100)
print(f"Adam result: {result:.6f}")
            

Adam has become the default optimizer for most deep learning tasks. The default hyperparameters (β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸) work well across a remarkably wide range of problems. When in doubt, use Adam.

The Loss Landscape

Understanding gradient descent also means understanding what it's navigating. The loss landscape of a neural network is incredibly complex — a surface in potentially millions of dimensions, full of valleys, ridges, saddle points, and plateaus.

A common fear is getting stuck in local minima: valleys that aren't the lowest point overall. Interestingly, research has shown this is less of a problem than once thought. In high-dimensional spaces, most critical points are saddle points (like a horse's saddle — going up in some directions, down in others) rather than true local minima. SGD's stochasticity helps escape these.

The bigger practical concern is often flat regions where gradients are tiny. Training can stall for thousands of iterations before finding a direction that leads down. Momentum and Adam help here by maintaining velocity through flat regions.

Loss
  │
  │    ╱╲
  │   ╱  ╲      Local
  │  ╱    ╲     Minimum
  │ ╱      ╲      ↓
  │╱   ○    ╲────────╲
  │    ↑              ╲    Global
  │  Saddle            ╲   Minimum
  │  Point              ╲    ↓
  │                      ╲──○
  └──────────────────────────────→ Parameters

Practical Considerations

Gradient descent is elegant in theory but requires care in practice. The gradient assumes small steps — take too large a step and the linear approximation breaks down. Exploding gradients can send parameters to infinity; gradient clipping caps the maximum gradient magnitude to prevent this.

Vanishing gradients are the opposite problem: in deep networks, gradients can shrink exponentially as they backpropagate, leaving early layers unable to learn. Careful initialisation, normalisation techniques like batch norm, and architectures with skip connections all help mitigate this.

Regularisation adds terms to the loss that penalise large weights, preventing overfitting. This changes the loss landscape, typically making it smoother and easier to optimise.

Gradient descent is beautiful in its simplicity: follow the slope downward, adjust how big your steps are, and eventually you'll find a valley. Yet from this simple principle emerges the capacity to learn almost anything from data.

When you train a neural network, remember that beneath all the abstractions, parameters are simply sliding down a mathematical surface, one tiny step at a time. Millions of numbers, adjusting themselves to predict just a little better than before. That's the engine of machine learning, and understanding it gives you the power to build, debug, and improve it.