There's a saying in machine learning that data scientists spend 80% of their time on data preprocessing and 20% complaining about it. The percentages might be exaggerated, but the sentiment is accurate. Raw data is messy, inconsistent, and full of traps that can sabotage even the most sophisticated model.
I've lost count of how many hours I've spent debugging a model that turned out to have perfect architecture but was being fed garbage data. The model wasn't learning because the data wasn't learnable in the form I'd given it. Preprocessing transforms raw data into something a model can actually use. It's not glamorous work, but it's where projects succeed or fail.
This guide covers the essential preprocessing techniques that every data scientist needs to master. We'll work through each one with code you can actually use.
Understanding Your Data First
Before you preprocess anything, you need to understand what you're working with. Rushing into transformations without exploration is how you make mistakes that don't surface until months later.
Initial data explorationimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def explore_dataset(df):
"""Comprehensive initial exploration of a dataset."""
print("=" * 50)
print("DATASET OVERVIEW")
print("=" * 50)
print(f"\nShape: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nColumn types:")
print(df.dtypes.value_counts())
print("\n" + "=" * 50)
print("MISSING VALUES")
print("=" * 50)
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
'Missing': missing,
'Percent': missing_pct
})
print(missing_df[missing_df['Missing'] > 0].sort_values('Percent', ascending=False))
print("\n" + "=" * 50)
print("NUMERIC COLUMNS SUMMARY")
print("=" * 50)
print(df.describe())
print("\n" + "=" * 50)
print("CATEGORICAL COLUMNS")
print("=" * 50)
categorical = df.select_dtypes(include=['object', 'category'])
for col in categorical.columns:
print(f"\n{col}: {df[col].nunique()} unique values")
print(df[col].value_counts().head(5))
# Usage example with sample data
df = pd.read_csv('your_dataset.csv')
explore_dataset(df)
This exploration reveals the shape of your problem. How many missing values are you dealing with? What are the distributions of your numeric columns? Are there categorical variables with hundreds of unique values that might need special handling?
Handling Missing Values
Missing data is inevitable. Sensors fail, users skip form fields, records get corrupted. How you handle missing values can dramatically affect your model's performance.
The simplest approach is deletion: remove any row with missing values. This works when missing data is rare and random, but it's wasteful when you have limited data, and it can introduce bias if the missingness is systematic.
Strategies for missing valuesfrom sklearn.impute import SimpleImputer, KNNImputer
def handle_missing_values(df, strategy='auto'):
"""
Handle missing values with appropriate strategies per column.
Args:
df: DataFrame with potential missing values
strategy: 'auto', 'drop', 'mean', 'median', 'mode', 'knn'
Returns:
DataFrame with missing values handled
"""
df_clean = df.copy()
if strategy == 'drop':
return df_clean.dropna()
# Separate numeric and categorical columns
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
categorical_cols = df_clean.select_dtypes(include=['object', 'category']).columns
if strategy == 'auto':
# Numeric: use median (robust to outliers)
for col in numeric_cols:
if df_clean[col].isnull().sum() > 0:
df_clean[col].fillna(df_clean[col].median(), inplace=True)
# Categorical: use mode
for col in categorical_cols:
if df_clean[col].isnull().sum() > 0:
df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)
elif strategy == 'knn':
# KNN imputation for numeric columns (more sophisticated)
imputer = KNNImputer(n_neighbors=5)
df_clean[numeric_cols] = imputer.fit_transform(df_clean[numeric_cols])
# Still use mode for categorical
for col in categorical_cols:
if df_clean[col].isnull().sum() > 0:
df_clean[col].fillna(df_clean[col].mode()[0], inplace=True)
return df_clean
# Sometimes missing data is informative
def add_missing_indicators(df, columns=None):
"""Add binary columns indicating where data was missing."""
df_new = df.copy()
cols_to_check = columns if columns else df.columns
for col in cols_to_check:
if df[col].isnull().sum() > 0:
df_new[f'{col}_was_missing'] = df[col].isnull().astype(int)
return df_new
KNN imputation is particularly clever. It finds the k most similar rows (based on other features) and uses their values to fill in the missing one. This preserves relationships between features that simple mean imputation would destroy.
But sometimes missingness itself is informative. If a field for "annual income" is missing, that might indicate something about the person. Adding a binary indicator column captures this signal.
Scaling and Normalisation
Many machine learning algorithms are sensitive to the scale of features. If one feature ranges from 0 to 1 and another from 0 to 1,000,000, the algorithm might give disproportionate weight to the larger one simply because its numbers are bigger.
Feature scaling techniquesfrom sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
def scale_features(train_df, test_df, method='standard', columns=None):
"""
Scale numeric features using various methods.
Args:
train_df: Training data
test_df: Test data
method: 'standard', 'minmax', or 'robust'
columns: Columns to scale (None = all numeric)
Returns:
Scaled train and test DataFrames, fitted scaler
"""
scalers = {
'standard': StandardScaler(), # Mean=0, Std=1
'minmax': MinMaxScaler(), # Range [0, 1]
'robust': RobustScaler() # Uses median/IQR, robust to outliers
}
scaler = scalers[method]
if columns is None:
columns = train_df.select_dtypes(include=[np.number]).columns
# Fit on training data only
scaler.fit(train_df[columns])
# Transform both sets
train_scaled = train_df.copy()
test_scaled = test_df.copy()
train_scaled[columns] = scaler.transform(train_df[columns])
test_scaled[columns] = scaler.transform(test_df[columns])
return train_scaled, test_scaled, scaler
# Demonstration of the three methods
def compare_scaling_methods(data):
"""Visualise different scaling approaches."""
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
methods = [('Original', data)]
for name, Scaler in [('Standard', StandardScaler),
('MinMax', MinMaxScaler),
('Robust', RobustScaler)]:
scaled = Scaler().fit_transform(data.reshape(-1, 1))
methods.append((name, scaled.flatten()))
for ax, (name, values) in zip(axes, methods):
ax.hist(values, bins=30, edgecolor='black')
ax.set_title(f'{name}\nMin: {values.min():.2f}, Max: {values.max():.2f}')
plt.tight_layout()
plt.show()
The choice of scaler matters. StandardScaler assumes your data is roughly normally distributed. MinMaxScaler preserves zero entries (important for sparse data). RobustScaler uses the median and interquartile range, making it resistant to outliers that would skew the other methods.
One critical rule: always fit your scaler on training data only, then apply the same transformation to test data. If you fit on the entire dataset, you've leaked information from the test set into your training process, and your performance estimates will be optimistic.
Encoding Categorical Variables
Machine learning algorithms work with numbers, but much real-world data is categorical. Cities, product categories, day of the week, these need to be converted to numeric form.
Categorical encoding strategiesfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder
import category_encoders as ce
def encode_categoricals(train_df, test_df, columns, method='onehot', target=None):
"""
Encode categorical variables.
Args:
train_df: Training DataFrame
test_df: Test DataFrame
columns: Categorical columns to encode
method: 'onehot', 'label', 'target', 'frequency'
target: Target column (required for target encoding)
Returns:
Encoded train and test DataFrames
"""
train_encoded = train_df.copy()
test_encoded = test_df.copy()
if method == 'onehot':
# Creates binary columns for each category
train_encoded = pd.get_dummies(train_encoded, columns=columns,
drop_first=True)
test_encoded = pd.get_dummies(test_encoded, columns=columns,
drop_first=True)
# Align columns (test might have categories not in train)
train_cols = set(train_encoded.columns)
test_cols = set(test_encoded.columns)
for col in train_cols - test_cols:
test_encoded[col] = 0
for col in test_cols - train_cols:
test_encoded.drop(col, axis=1, inplace=True)
test_encoded = test_encoded[train_encoded.columns]
elif method == 'label':
# Simple integer encoding
for col in columns:
le = LabelEncoder()
train_encoded[col] = le.fit_transform(train_encoded[col].astype(str))
# Handle unseen categories in test
test_values = test_encoded[col].astype(str)
test_encoded[col] = test_values.apply(
lambda x: le.transform([x])[0] if x in le.classes_
else -1
)
elif method == 'target':
# Encode based on target mean (powerful but risk of leakage)
encoder = ce.TargetEncoder(cols=columns)
train_encoded = encoder.fit_transform(train_encoded, train_df[target])
test_encoded = encoder.transform(test_encoded)
elif method == 'frequency':
# Encode based on frequency of each category
for col in columns:
freq = train_df[col].value_counts(normalize=True)
train_encoded[col] = train_df[col].map(freq)
test_encoded[col] = test_df[col].map(freq).fillna(0)
return train_encoded, test_encoded
One-hot encoding is the safest default. It creates a binary column for each category, making no assumptions about ordering. The downside is explosion: a column with 1000 unique values becomes 999 new columns.
Target encoding replaces each category with the mean of the target variable for that category. It's powerful for high-cardinality features but dangerous: it can cause severe overfitting if not done carefully with cross-validation.
Detecting and Handling Outliers
Outliers can devastate your model's training. A single data entry error, a value of 999999 where someone meant 99, can shift your entire model's behaviour.
Outlier detection and treatmentdef detect_outliers_iqr(df, columns=None, multiplier=1.5):
"""
Detect outliers using the IQR method.
Args:
df: DataFrame to check
columns: Columns to check (None = all numeric)
multiplier: IQR multiplier (1.5 is standard, 3.0 for extreme)
Returns:
DataFrame with outlier information
"""
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns
outlier_info = {}
for col in columns:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - multiplier * IQR
upper_bound = Q3 + multiplier * IQR
outliers = (df[col] < lower_bound) | (df[col] > upper_bound)
outlier_info[col] = {
'lower_bound': lower_bound,
'upper_bound': upper_bound,
'n_outliers': outliers.sum(),
'pct_outliers': (outliers.sum() / len(df)) * 100,
'outlier_mask': outliers
}
return outlier_info
def handle_outliers(df, columns, method='clip', multiplier=1.5):
"""
Handle outliers using various methods.
Args:
df: DataFrame to process
columns: Columns to process
method: 'clip', 'remove', or 'transform'
multiplier: IQR multiplier for bounds
Returns:
Processed DataFrame
"""
df_clean = df.copy()
for col in columns:
Q1 = df_clean[col].quantile(0.25)
Q3 = df_clean[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
if method == 'clip':
# Cap values at bounds
df_clean[col] = df_clean[col].clip(lower=lower, upper=upper)
elif method == 'remove':
# Remove outlier rows
mask = (df_clean[col] >= lower) & (df_clean[col] <= upper)
df_clean = df_clean[mask]
elif method == 'transform':
# Log transform (good for right-skewed data)
df_clean[col] = np.log1p(df_clean[col])
return df_clean
Clipping is conservative: outliers are capped at reasonable bounds rather than removed. This preserves all your data points while limiting extreme values' influence. Removal is more aggressive but loses information. Log transformation is elegant for naturally skewed distributions like income or population, but it requires all values to be positive.
Handling Skewed Distributions
Many real-world distributions are skewed. Income, for instance, has a long right tail: most people earn modest amounts, but a few earn enormously. Algorithms that assume normal distributions struggle with such data.
Transforming skewed distributionsfrom scipy import stats
def fix_skewness(df, columns=None, threshold=0.5):
"""
Apply appropriate transformations to reduce skewness.
Args:
df: DataFrame to process
columns: Columns to check (None = all numeric)
threshold: Skewness threshold triggering transformation
Returns:
Transformed DataFrame, dictionary of applied transformations
"""
df_transformed = df.copy()
transformations = {}
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns
for col in columns:
skewness = df_transformed[col].skew()
if abs(skewness) > threshold:
# Try log transform first (for positive values)
if (df_transformed[col] > 0).all():
log_skew = np.log1p(df_transformed[col]).skew()
if abs(log_skew) < abs(skewness):
df_transformed[col] = np.log1p(df_transformed[col])
transformations[col] = 'log1p'
continue
# Try Box-Cox (requires positive values)
if (df_transformed[col] > 0).all():
transformed, lambda_param = stats.boxcox(df_transformed[col])
bc_skew = pd.Series(transformed).skew()
if abs(bc_skew) < abs(skewness):
df_transformed[col] = transformed
transformations[col] = f'boxcox (lambda={lambda_param:.3f})'
continue
# Try square root for moderate right skew
if skewness > 0 and (df_transformed[col] >= 0).all():
sqrt_skew = np.sqrt(df_transformed[col]).skew()
if abs(sqrt_skew) < abs(skewness):
df_transformed[col] = np.sqrt(df_transformed[col])
transformations[col] = 'sqrt'
print("Applied transformations:")
for col, transform in transformations.items():
print(f" {col}: {transform}")
return df_transformed, transformations
The Box-Cox transformation is particularly powerful because it automatically finds the optimal transformation parameter. A lambda of 0 corresponds to log transform, 0.5 to square root, 1 to no transformation. The algorithm searches this continuum for the value that best normalises your distribution.
A Complete Preprocessing Pipeline
Individual techniques are useful, but real preprocessing involves combining them into a coherent pipeline. Here's a template that ties everything together.
Complete preprocessing pipelinefrom sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
def create_preprocessing_pipeline(numeric_features, categorical_features):
"""
Create a scikit-learn preprocessing pipeline.
Args:
numeric_features: List of numeric column names
categorical_features: List of categorical column names
Returns:
Configured ColumnTransformer
"""
from sklearn.impute import SimpleImputer
# Numeric preprocessing: impute, then scale
numeric_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', RobustScaler())
])
# Categorical preprocessing: impute, then encode
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
# Combine into a single transformer
preprocessor = ColumnTransformer([
('numeric', numeric_pipeline, numeric_features),
('categorical', categorical_pipeline, categorical_features)
])
return preprocessor
# Usage
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['city', 'product_type', 'channel']
preprocessor = create_preprocessing_pipeline(numeric_features, categorical_features)
# Fit on training data
X_train_processed = preprocessor.fit_transform(X_train)
# Transform test data (using parameters from training)
X_test_processed = preprocessor.transform(X_test)
The scikit-learn Pipeline and ColumnTransformer classes are enormously useful. They ensure that each transformation is fit only on training data, they handle the bookkeeping of tracking which transformations apply to which columns, and they can be serialised for deployment.
The Philosophy of Preprocessing
I want to leave you with a broader perspective. Preprocessing isn't just about getting data into the right format for an algorithm. It's about understanding your data deeply enough to help your model learn.
Every preprocessing decision embeds assumptions. Using mean imputation assumes missing values are random. Using one-hot encoding assumes no ordinal relationship between categories. Using standard scaling assumes something like normality. These assumptions might be right or wrong for your specific problem.
The best practitioners treat preprocessing as a source of hypotheses to test. Does log-transforming this feature improve model performance? Does target encoding help or hurt generalisation? The answers vary by problem, and finding them requires experimentation.
Preprocessing is also where domain knowledge pays the biggest dividends. A data scientist who understands finance will preprocess financial data differently than someone who doesn't. They'll know which missing values are meaningful, which outliers are real, which transformations preserve important relationships.
Master these techniques, but more importantly, master the habit of looking deeply at your data before you transform it. The patterns you notice will guide your preprocessing choices, and those choices will shape everything that follows.