Reinforcement Learning Fundamentals

There's something uniquely satisfying about watching a reinforcement learning agent learn. Unlike supervised learning, where you show the model exactly what to do, reinforcement learning starts from ignorance. The agent stumbles around, makes mistakes, occasionally gets lucky, and gradually, through thousands of trials, discovers strategies that no human programmed.

I remember the first time I trained an agent to play a simple game. In the beginning, it moved randomly, achieving nothing. An hour later, it had figured out basic tactics. By the next morning, it was playing better than I could. Nobody told it how to play. It learned from pure experience, guided only by the reward signal telling it when it did well and when it did poorly.

This paradigm, learning from interaction rather than instruction, is fundamentally different from other machine learning approaches. And it's behind some of the most impressive AI achievements: AlphaGo defeating world champions, robots learning to walk, and agents mastering complex video games. Let's understand how it works.

The Reinforcement Learning Framework

Reinforcement learning models learning as an interaction between an agent and an environment. The agent observes the current state of the environment, takes an action, and receives two things in return: a new state and a reward signal.

The goal is to learn a policy, a strategy for choosing actions that maximises the total reward over time. Not just immediate reward, but cumulative reward across many steps. This is crucial: sometimes you need to sacrifice immediate gain for long-term benefit.

Basic RL interaction loop

import numpy as np

class Environment:
    """Base class for RL environments."""

    def reset(self):
        """Reset environment to initial state."""
        raise NotImplementedError

    def step(self, action):
        """Take action, return (new_state, reward, done)."""
        raise NotImplementedError


class Agent:
    """Base class for RL agents."""

    def choose_action(self, state):
        """Select action given current state."""
        raise NotImplementedError

    def learn(self, state, action, reward, next_state, done):
        """Update policy based on experience."""
        raise NotImplementedError


def training_loop(env, agent, episodes=1000):
    """Standard RL training loop."""
    rewards_history = []

    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        done = False

        while not done:
            # Agent chooses action
            action = agent.choose_action(state)

            # Environment responds
            next_state, reward, done = env.step(action)

            # Agent learns from experience
            agent.learn(state, action, reward, next_state, done)

            state = next_state
            total_reward += reward

        rewards_history.append(total_reward)

        if (episode + 1) % 100 == 0:
            avg_reward = np.mean(rewards_history[-100:])
            print(f"Episode {episode + 1}, Avg Reward: {avg_reward:.2f}")

    return rewards_history

This loop is deceptively simple but captures the essence of reinforcement learning. The agent tries things, observes outcomes, and adjusts its behaviour. Over many episodes, it converges on effective strategies.

A Concrete Environment: GridWorld

To make this concrete, let's build a simple environment where an agent navigates a grid to reach a goal while avoiding obstacles.

GridWorld environment

class GridWorld(Environment):
    def __init__(self, size=5):
        """
        Simple grid world environment.

        Agent starts at (0, 0), goal at (size-1, size-1).
        Reward: -1 per step (encourages efficiency), +10 for goal.
        """
        self.size = size
        self.goal = (size - 1, size - 1)
        self.obstacles = {(1, 1), (2, 2), (3, 1)}  # Some walls
        self.actions = ['up', 'down', 'left', 'right']

    def reset(self):
        """Reset agent to starting position."""
        self.agent_pos = (0, 0)
        return self.agent_pos

    def step(self, action):
        """Take action, return new state and reward."""
        x, y = self.agent_pos

        # Compute new position
        if action == 'up':
            new_pos = (x, min(y + 1, self.size - 1))
        elif action == 'down':
            new_pos = (x, max(y - 1, 0))
        elif action == 'right':
            new_pos = (min(x + 1, self.size - 1), y)
        elif action == 'left':
            new_pos = (max(x - 1, 0), y)
        else:
            new_pos = (x, y)

        # Check for obstacles
        if new_pos not in self.obstacles:
            self.agent_pos = new_pos

        # Determine reward and done
        if self.agent_pos == self.goal:
            return self.agent_pos, 10.0, True  # Goal reached
        else:
            return self.agent_pos, -0.1, False  # Small penalty per step

    def render(self):
        """Visualise the grid."""
        grid = [['.' for _ in range(self.size)] for _ in range(self.size)]

        for ox, oy in self.obstacles:
            grid[self.size - 1 - oy][ox] = '#'

        gx, gy = self.goal
        grid[self.size - 1 - gy][gx] = 'G'

        ax, ay = self.agent_pos
        grid[self.size - 1 - ay][ax] = 'A'

        for row in grid:
            print(' '.join(row))
        print()

This environment has a state space of 25 positions (5x5 grid), an action space of 4 moves, and a clear goal structure. The negative step reward creates pressure to find efficient paths rather than wandering indefinitely.

Q-Learning: Learning Action Values

Q-learning is one of the most fundamental reinforcement learning algorithms. It learns a Q-function that estimates the value of taking each action in each state, the expected total future reward if you take that action and then follow the optimal policy thereafter.

Q(s, a) \leftarrow Q(s, a) + α [r + γ max Q(s', a') - Q(s, a)]

This update rule is the heart of Q-learning. α is the learning rate, γ is the discount factor (how much we value future vs immediate rewards), r is the immediate reward, and max Q(s', a') is the estimated value of the best action in the next state.

Q-learning agent

class QLearningAgent(Agent):
    def __init__(self, actions, learning_rate=0.1, discount=0.99, epsilon=0.1):
        """
        Q-Learning agent.

        Args:
            actions: List of possible actions
            learning_rate: How much to update Q-values (alpha)
            discount: Future reward discount factor (gamma)
            epsilon: Exploration rate for epsilon-greedy policy
        """
        self.actions = actions
        self.lr = learning_rate
        self.gamma = discount
        self.epsilon = epsilon

        # Q-table: maps (state, action) -> value
        self.q_table = {}

    def get_q_value(self, state, action):
        """Get Q-value for state-action pair."""
        return self.q_table.get((state, action), 0.0)

    def choose_action(self, state):
        """Choose action using epsilon-greedy policy."""
        if np.random.random() < self.epsilon:
            # Explore: random action
            return np.random.choice(self.actions)
        else:
            # Exploit: best known action
            q_values = [self.get_q_value(state, a) for a in self.actions]
            max_q = max(q_values)

            # Break ties randomly
            best_actions = [a for a, q in zip(self.actions, q_values)
                          if q == max_q]
            return np.random.choice(best_actions)

    def learn(self, state, action, reward, next_state, done):
        """Update Q-value based on experience."""
        current_q = self.get_q_value(state, action)

        if done:
            target = reward
        else:
            # Maximum Q-value for next state
            next_q_values = [self.get_q_value(next_state, a)
                            for a in self.actions]
            target = reward + self.gamma * max(next_q_values)

        # Q-learning update
        self.q_table[(state, action)] = current_q + self.lr * (target - current_q)

    def get_policy(self):
        """Extract the learned policy."""
        policy = {}
        states = set(s for (s, a) in self.q_table.keys())

        for state in states:
            q_values = {a: self.get_q_value(state, a) for a in self.actions}
            policy[state] = max(q_values, key=q_values.get)

        return policy

Let's train this agent on our GridWorld.

Training the agent

# Create environment and agent
env = GridWorld(size=5)
agent = QLearningAgent(
    actions=env.actions,
    learning_rate=0.1,
    discount=0.99,
    epsilon=0.1
)

# Train
rewards = training_loop(env, agent, episodes=1000)

# Show learned policy
print("\nLearned Policy:")
policy = agent.get_policy()
for y in range(env.size - 1, -1, -1):
    row = []
    for x in range(env.size):
        if (x, y) in env.obstacles:
            row.append('#')
        elif (x, y) == env.goal:
            row.append('G')
        elif (x, y) in policy:
            arrows = {'up': '↑', 'down': '↓', 'left': '←', 'right': '→'}
            row.append(arrows[policy[(x, y)]])
        else:
            row.append('?')
    print(' '.join(row))

After training, the agent has learned to navigate around obstacles and reach the goal efficiently. Each cell shows the best action to take from that position. The agent discovered this strategy purely through trial and error, without any prior knowledge of the environment's structure.

The Exploration-Exploitation Dilemma

One of the core challenges in reinforcement learning is balancing exploration and exploitation. Should the agent try new actions to discover potentially better strategies (explore)? Or should it stick with actions it already knows work well (exploit)?

Pure exploitation gets stuck on suboptimal strategies: if the first action you tried worked okay, you'll never discover the better alternatives. Pure exploration never capitalises on what you've learned.

The epsilon-greedy strategy we used is simple: with probability epsilon, take a random action; otherwise, take the best known action. This ensures some exploration while mostly exploiting current knowledge.

Exploration strategies

class ExplorationStrategies:
    """Different approaches to exploration."""

    @staticmethod
    def epsilon_greedy(q_values, epsilon=0.1):
        """Simple epsilon-greedy."""
        if np.random.random() < epsilon:
            return np.random.randint(len(q_values))
        return np.argmax(q_values)

    @staticmethod
    def epsilon_decay(q_values, episode, initial_eps=1.0, min_eps=0.01, decay=0.995):
        """Epsilon that decays over time."""
        epsilon = max(min_eps, initial_eps * (decay ** episode))
        if np.random.random() < epsilon:
            return np.random.randint(len(q_values))
        return np.argmax(q_values)

    @staticmethod
    def softmax(q_values, temperature=1.0):
        """
        Softmax exploration: probabilities based on Q-values.

        Higher temperature = more random.
        Lower temperature = more greedy.
        """
        exp_q = np.exp(q_values / temperature)
        probs = exp_q / np.sum(exp_q)
        return np.random.choice(len(q_values), p=probs)

    @staticmethod
    def ucb(q_values, action_counts, t, c=2.0):
        """
        Upper Confidence Bound: balances value and uncertainty.

        Actions taken less often get an exploration bonus.
        """
        ucb_values = q_values + c * np.sqrt(np.log(t + 1) / (action_counts + 1))
        return np.argmax(ucb_values)

Each strategy has trade-offs. Epsilon-decay explores a lot early when we know nothing, then exploits more as we learn. Softmax gives some probability to all actions proportional to their estimated value. UCB (Upper Confidence Bound) explicitly tracks uncertainty and favours actions we're uncertain about.

Deep Q-Networks: Scaling Up

Q-learning with tables works for small state spaces, but what about environments with millions of states, or continuous state spaces? We can't maintain a table entry for every possible state.

Deep Q-Networks (DQN) solve this by using a neural network to approximate the Q-function. Instead of storing Q(s, a) for each state-action pair, we train a network to predict Q-values given any state.

Deep Q-Network

import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import random

class DQN(nn.Module):
    """Neural network for Q-value approximation."""

    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

    def forward(self, state):
        return self.network(state)


class ReplayBuffer:
    """Experience replay buffer for stable training."""

    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (np.array(states), np.array(actions), np.array(rewards),
                np.array(next_states), np.array(dones))

    def __len__(self):
        return len(self.buffer)


class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=0.001, gamma=0.99,
                 epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.gamma = gamma

        # Exploration parameters
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay

        # Neural networks
        self.policy_net = DQN(state_dim, action_dim)
        self.target_net = DQN(state_dim, action_dim)
        self.target_net.load_state_dict(self.policy_net.state_dict())

        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=lr)
        self.buffer = ReplayBuffer()

    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.action_dim)

        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        with torch.no_grad():
            q_values = self.policy_net(state_tensor)
        return q_values.argmax().item()

    def learn(self, batch_size=32):
        if len(self.buffer) < batch_size:
            return

        # Sample batch from replay buffer
        states, actions, rewards, next_states, dones = self.buffer.sample(batch_size)

        # Convert to tensors
        states = torch.FloatTensor(states)
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(next_states)
        dones = torch.FloatTensor(dones)

        # Current Q-values
        current_q = self.policy_net(states).gather(1, actions.unsqueeze(1))

        # Target Q-values
        with torch.no_grad():
            next_q = self.target_net(next_states).max(1)[0]
            target_q = rewards + self.gamma * next_q * (1 - dones)

        # Loss and update
        loss = nn.MSELoss()(current_q.squeeze(), target_q)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Decay epsilon
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

    def update_target_network(self):
        """Copy weights from policy network to target network."""
        self.target_net.load_state_dict(self.policy_net.state_dict())

DQN introduces two crucial innovations. Experience replay stores past experiences and samples from them randomly during training, breaking correlations between consecutive updates. The target network provides stable targets for the Q-learning updates, preventing the moving-target problem that destabilises naive neural Q-learning.

Policy Gradient Methods

Q-learning learns a value function and derives a policy from it. Policy gradient methods take a different approach: directly optimise the policy without explicitly computing values.

The key insight is that we can estimate the gradient of expected reward with respect to policy parameters, then follow that gradient uphill.

Simple policy gradient (REINFORCE)

class PolicyNetwork(nn.Module):
    """Neural network that outputs action probabilities."""

    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, state):
        return self.network(state)


class REINFORCEAgent:
    """Policy gradient agent using REINFORCE algorithm."""

    def __init__(self, state_dim, action_dim, lr=0.01, gamma=0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)

        # Episode memory
        self.log_probs = []
        self.rewards = []

    def choose_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0)
        probs = self.policy(state_tensor)

        # Sample action from probability distribution
        dist = torch.distributions.Categorical(probs)
        action = dist.sample()

        # Store log probability for learning
        self.log_probs.append(dist.log_prob(action))

        return action.item()

    def store_reward(self, reward):
        self.rewards.append(reward)

    def learn(self):
        """Update policy after episode ends."""
        # Calculate discounted returns
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)

        returns = torch.tensor(returns)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        # Policy gradient loss
        loss = 0
        for log_prob, G in zip(self.log_probs, returns):
            loss -= log_prob * G  # Negative because we maximize

        # Update
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        # Clear episode memory
        self.log_probs = []
        self.rewards = []

REINFORCE is elegant in its simplicity. After each episode, it increases the probability of actions that led to high returns and decreases the probability of actions that led to low returns. The policy improves without ever computing explicit Q-values.

The Power and Promise of RL

What makes reinforcement learning special is its generality. The same algorithms that learn to navigate grid worlds can learn to play Atari games, control robots, manage power grids, or optimise data centre cooling. The agent doesn't need to know what problem it's solving. It just needs an environment that provides states and rewards.

This generality comes with challenges. RL is notoriously sample-inefficient: agents often need millions of experiences to learn what a human could learn from a few examples. Reward design is tricky: poorly specified rewards lead to unexpected and sometimes amusing behaviours. And stability remains an issue: training can be finicky, with performance sometimes collapsing after seeming to converge.

But the potential is immense. We're teaching machines to learn from interaction, to discover strategies through experience, to improve continuously as they encounter new situations. This is closer to how biological intelligence works than any other machine learning paradigm.

The fundamentals we've covered here, the RL framework, Q-learning, exploration strategies, deep RL, policy gradients, form the foundation for understanding more advanced methods. From here, you can explore actor-critic methods that combine value and policy learning, model-based RL that learns to predict environment dynamics, or multi-agent settings where multiple learners interact.

The field is moving fast. But these fundamentals will serve you well no matter where it goes next.