There's something genuinely thrilling about the moment a machine learning model makes its first correct prediction. You've fed it data, watched it learn patterns you couldn't have programmed explicitly, and now it's making decisions on information it's never seen before. It feels like teaching something to think.
But getting to that moment can feel overwhelming. The machine learning ecosystem is vast, the terminology is dense, and every tutorial seems to assume you already know something you don't. This guide is different. We're going to build a complete, working machine learning model from scratch, and I'll explain every step along the way.
We'll work with a classic problem: predicting whether a passenger survived the Titanic disaster based on their characteristics. It's a morbid dataset, perhaps, but it's rich enough to teach real techniques while being small enough to understand completely.
Setting Up Your Environment
Before we write any code, let's make sure we have the right tools. You'll need Python and a few key libraries. If you're working in a Jupyter notebook or Google Colab, most of these come pre-installed.
Installing dependencies# Install the core libraries we'll need
pip install pandas numpy scikit-learn matplotlib seaborn
Each library serves a specific purpose in our pipeline. Pandas handles data manipulation, NumPy provides efficient numerical operations, scikit-learn gives us machine learning algorithms, and matplotlib with seaborn help us visualise what's happening at each step.
Importing librariesimport pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Set display options for cleaner output
pd.set_option('display.max_columns', None)
np.random.seed(42) # For reproducibility
Setting a random seed might seem like a minor detail, but it's crucial for reproducibility. Machine learning involves randomness at many stages, and without a fixed seed, you'll get slightly different results every time you run your code. When debugging or comparing approaches, consistent results are essential.
Loading and Understanding the Data
The Titanic dataset is available from many sources, including Kaggle. For this tutorial, we'll load it directly from a URL so you can follow along without downloading anything.
Loading the dataset# Load the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
# First look at our data
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()
When I first load any dataset, I want to understand its shape and structure. The Titanic dataset has 891 rows and 12 columns. Each row represents a passenger, and each column represents a characteristic: their class, name, sex, age, how many siblings they had aboard, their fare, and crucially, whether they survived.
But numbers alone don't tell the story. Let's look at what we're actually working with.
Exploratory data analysis# Get a statistical summary
df.describe()
# Check for missing values - this is critical
print("Missing values per column:")
print(df.isnull().sum())
# Look at our target variable distribution
print(f"\nSurvival distribution:")
print(df['Survived'].value_counts(normalize=True))
This reveals something important: about 38% of passengers survived, 62% did not. This imbalance matters because a model could achieve 62% accuracy by simply predicting "did not survive" for everyone. That's our baseline to beat.
We also discover that the Age column has 177 missing values, and Cabin has 687 missing. These gaps need addressing before we can train our model.
The Art of Data Cleaning
Real-world data is messy. Missing values, inconsistent formats, outliers that don't make sense. Data cleaning isn't the glamorous part of machine learning, but it's where you'll spend most of your time, and it's where the real skill lies.
Handling missing values# Create a copy to preserve the original
df_clean = df.copy()
# Age: fill with median (more robust than mean to outliers)
median_age = df_clean['Age'].median()
df_clean['Age'].fillna(median_age, inplace=True)
# Embarked: fill with mode (most common value)
mode_embarked = df_clean['Embarked'].mode()[0]
df_clean['Embarked'].fillna(mode_embarked, inplace=True)
# Cabin: too many missing values, let's create a binary feature instead
df_clean['HasCabin'] = df_clean['Cabin'].notna().astype(int)
# Verify no missing values in columns we'll use
print("Missing values after cleaning:")
print(df_clean[['Age', 'Embarked', 'HasCabin']].isnull().sum())
Notice the thinking here. For Age, we use the median rather than the mean because age distributions often have outliers, and the median is more robust to these. For Cabin, instead of trying to impute 687 missing values, we transform the problem: maybe having a cabin recorded at all is informative, perhaps indicating higher-class passengers.
This kind of domain reasoning, asking "what might this missing data actually mean?", is what separates good data scientists from those who just run algorithms blindly.
Feature Engineering
Feature engineering is where creativity meets data science. We're not just cleaning data now; we're creating new information that might help our model learn patterns more effectively.
Creating new features# Family size: siblings/spouses + parents/children + self
df_clean['FamilySize'] = df_clean['SibSp'] + df_clean['Parch'] + 1
# Is the person traveling alone?
df_clean['IsAlone'] = (df_clean['FamilySize'] == 1).astype(int)
# Extract title from name - this encodes social status
df_clean['Title'] = df_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
# Simplify rare titles
title_mapping = {
'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master',
'Dr': 'Rare', 'Rev': 'Rare', 'Col': 'Rare', 'Major': 'Rare',
'Mlle': 'Miss', 'Countess': 'Rare', 'Ms': 'Miss',
'Lady': 'Rare', 'Jonkheer': 'Rare', 'Don': 'Rare',
'Dona': 'Rare', 'Mme': 'Mrs', 'Capt': 'Rare', 'Sir': 'Rare'
}
df_clean['Title'] = df_clean['Title'].map(title_mapping)
df_clean['Title'].fillna('Rare', inplace=True)
# Age groups - sometimes categories work better than continuous values
df_clean['AgeGroup'] = pd.cut(df_clean['Age'],
bins=[0, 12, 20, 40, 60, 100],
labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])
print("New features created:")
print(df_clean[['FamilySize', 'IsAlone', 'Title', 'AgeGroup']].head(10))
The title extraction is particularly clever. Hidden in each passenger's name is their social title, which encodes information about their gender, age, and social class all at once. "Master" was used for young boys. "Miss" and "Mrs" distinguish unmarried from married women. Titles like "Dr" or "Sir" indicate status.
By extracting this information, we're giving our model richer data to work with than the original columns alone could provide.
Preparing Features for Training
Machine learning algorithms work with numbers. Our categorical features like "Sex" and "Title" need to be converted into numerical form. This process is called encoding.
Encoding categorical variables# Select features for our model
features = ['Pclass', 'Sex', 'Age', 'Fare', 'Embarked',
'FamilySize', 'IsAlone', 'HasCabin', 'Title']
# Create a fresh dataframe with only what we need
df_model = df_clean[features + ['Survived']].copy()
# Convert categorical columns to numeric using one-hot encoding
df_model = pd.get_dummies(df_model, columns=['Sex', 'Embarked', 'Title'],
drop_first=True)
print(f"Final feature set: {df_model.shape[1] - 1} features")
print("Columns:", list(df_model.columns))
One-hot encoding transforms categorical variables into binary columns. "Sex" becomes "Sex_male" (1 if male, 0 if female). "Embarked" splits into "Embarked_Q" and "Embarked_S", with Cherbourg (C) as the implicit reference category.
The drop_first=True parameter prevents a subtle issue called the "dummy variable trap", where perfectly correlated columns can confuse certain algorithms.
The Train-Test Split
Here's a fundamental principle of machine learning: you must evaluate your model on data it has never seen during training. This simulates how the model will perform in the real world, where it encounters new, unfamiliar data.
Splitting the data# Separate features (X) from target (y)
X = df_model.drop('Survived', axis=1)
y = df_model['Survived']
# Split: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
# Verify the split preserved class distribution
print(f"\nSurvival rate in training: {y_train.mean():.3f}")
print(f"Survival rate in test: {y_test.mean():.3f}")
The stratify=y parameter is important. It ensures that both the training and test sets have the same proportion of survivors and non-survivors as the original dataset. Without stratification, you might accidentally create a test set that's mostly survivors, which would give you misleading performance metrics.
Scaling Features
Some algorithms are sensitive to the scale of input features. Age ranges from 0 to 80, while Fare ranges from 0 to 500. Without scaling, the algorithm might give more importance to Fare simply because its numbers are bigger.
Feature scaling# Initialize the scaler
scaler = StandardScaler()
# Fit on training data only, then transform both sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrame for readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
print("Scaled feature statistics:")
print(X_train_scaled.describe().round(2))
Notice that we fit the scaler on the training data only, then apply the same transformation to the test data. This prevents "data leakage", where information from the test set influences the training process. Even something as innocent as calculating the mean and standard deviation across all data can subtly inflate your results.
Training the Model
We're finally ready to train. We'll use a Random Forest classifier, which builds multiple decision trees and combines their predictions. It's a robust algorithm that works well out of the box for many problems.
Training a Random Forest# Initialize the model with sensible defaults
model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Prevent overfitting
min_samples_split=5, # Minimum samples to split a node
random_state=42
)
# Train the model
model.fit(X_train_scaled, y_train)
print("Model trained successfully!")
print(f"Number of trees: {model.n_estimators}")
print(f"Features used: {model.n_features_in_}")
The fit method is where all the learning happens. The algorithm examines the training data, builds decision trees that capture patterns in the features, and stores these patterns internally. This process takes a few seconds at most for a dataset this size.
Evaluating Performance
Training is done, but how good is our model? This is where the test set we held back becomes essential.
Making predictions and evaluating# Make predictions on the test set
y_pred = model.predict(X_test_scaled)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Baseline (always predict death): 0.6200")
print(f"Improvement over baseline: {((accuracy - 0.62) / 0.62) * 100:.1f}%")
# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=['Did not survive', 'Survived']))
On this dataset, you should see accuracy around 82-84%, which represents a significant improvement over the 62% baseline. The classification report breaks this down further, showing precision and recall for each class.
Precision tells you: of all the passengers the model predicted would survive, how many actually did? Recall tells you: of all the passengers who actually survived, how many did the model correctly identify?
Understanding What the Model Learned
One advantage of Random Forests is interpretability. We can examine which features the model found most useful for making predictions.
Feature importance analysis# Get feature importances
importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': model.feature_importances_
}).sort_values('importance', ascending=False)
print("Feature Importances:")
print(importance_df.to_string(index=False))
# Visualize
plt.figure(figsize=(10, 6))
plt.barh(importance_df['feature'], importance_df['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance in Survival Prediction')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
The results align with historical knowledge about the Titanic. Being male dramatically decreased survival chances, reflecting the "women and children first" protocol. Passenger class mattered because first-class passengers had cabins closer to the lifeboats. Age played a role, with children being prioritised for rescue.
This interpretability is valuable. When your model's reasoning aligns with domain knowledge, you can trust it more. When it doesn't, you've either discovered something interesting or found a bug.
Making Predictions on New Data
Let's see how to use our trained model to predict survival for a hypothetical new passenger.
Predicting for new passengers# Create a hypothetical passenger
new_passenger = {
'Pclass': 1, # First class
'Age': 28, # 28 years old
'Fare': 100, # Expensive ticket
'FamilySize': 2, # Traveling with one other
'IsAlone': 0, # Not alone
'HasCabin': 1, # Has a cabin
'Sex_male': 0, # Female
'Embarked_Q': 0, # Not Queenstown
'Embarked_S': 0, # Cherbourg (reference)
'Title_Miss': 1, # Miss
'Title_Mr': 0,
'Title_Mrs': 0,
'Title_Rare': 0
}
# Convert to DataFrame and scale
new_df = pd.DataFrame([new_passenger])
new_scaled = scaler.transform(new_df)
# Predict
prediction = model.predict(new_scaled)[0]
probability = model.predict_proba(new_scaled)[0]
print(f"Prediction: {'Survived' if prediction == 1 else 'Did not survive'}")
print(f"Confidence: {max(probability):.1%}")
print(f"Probability distribution: Death={probability[0]:.1%}, Survival={probability[1]:.1%}")
For this hypothetical first-class woman, the model predicts survival with high confidence. Change "Sex_male" to 1 and "Title_Miss" to 0, "Title_Mr" to 1, and watch the prediction flip. The model has learned the historical patterns.
Where to Go from Here
We've built a complete machine learning pipeline: loading data, cleaning it, engineering features, training a model, evaluating its performance, and making predictions. This same pattern applies whether you're classifying emails as spam, predicting house prices, or detecting fraudulent transactions.
There's always more to learn. You could experiment with different algorithms, perhaps a gradient boosting model like XGBoost. You could tune hyperparameters more systematically using cross-validation and grid search. You could try more sophisticated feature engineering, perhaps extracting information from the Cabin codes or creating interaction features.
But the fundamentals you've learned here will serve you well regardless of how deep you go. Understanding your data matters more than choosing the fanciest algorithm. Clean, well-engineered features often matter more than complex models. And always, always evaluate on data your model hasn't seen.
The Titanic dataset taught you something real. Now go find a problem you actually care about, and build something that matters.