The Art of Feature Engineering

Where domain knowledge meets machine learning. The creative process of transforming raw data into features that unlock model performance.

A senior data scientist once told me that the difference between a good model and a great one rarely comes down to the algorithm. It comes down to the features. I didn't fully understand this until I spent weeks wrestling with a problem where a simple model with carefully crafted features outperformed a complex neural network fed raw data.

Feature engineering is the process of using domain knowledge to create input features that make machine learning algorithms work better. It's part science, part art, and often the highest-leverage activity in a data science project. A single well-designed feature can be worth more than hours of hyperparameter tuning.

This guide covers the core techniques of feature engineering with practical code examples. But more importantly, it teaches the mindset: how to think about your data in ways that help models learn.

The Philosophy of Features

Every feature answers a question about your data. When predicting whether a customer will churn, "days since last login" answers "how engaged are they recently?" while "total purchases" answers "how invested are they overall?" The art lies in finding the right questions.

Good features have several properties. They're predictive: they actually correlate with what you're trying to predict. They're independent: they capture different aspects of the data rather than duplicating information. They're interpretable: you can explain what they represent. And they're robust: they work across different data samples, not just the one you trained on.

Let's start with a dataset and build features systematically.

Setting up our example dataset
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Simulated e-commerce transaction data
np.random.seed(42)
n_customers = 1000

# Generate sample data
data = {
    'customer_id': range(n_customers),
    'signup_date': [datetime(2023, 1, 1) + timedelta(days=np.random.randint(0, 365))
                    for _ in range(n_customers)],
    'last_purchase_date': [datetime(2024, 1, 1) - timedelta(days=np.random.randint(0, 180))
                           for _ in range(n_customers)],
    'total_purchases': np.random.exponential(5, n_customers).astype(int) + 1,
    'total_spend': np.random.exponential(500, n_customers).round(2),
    'returns': np.random.poisson(1, n_customers),
    'support_tickets': np.random.poisson(0.5, n_customers),
    'email_opens': np.random.randint(0, 50, n_customers),
    'emails_sent': np.random.randint(10, 60, n_customers),
    'age': np.random.normal(35, 12, n_customers).clip(18, 80).astype(int),
    'country': np.random.choice(['UK', 'US', 'DE', 'FR'], n_customers),
}

df = pd.DataFrame(data)
print(df.head())

Temporal Features

Time is one of the richest sources of features. When something happens often matters as much as what happened. Customer behaviour yesterday predicts behaviour tomorrow better than behaviour from a year ago.

Engineering temporal features
def engineer_temporal_features(df, reference_date=None):
    """Create features from date columns."""
    if reference_date is None:
        reference_date = datetime(2024, 1, 1)

    df = df.copy()

    # Days since signup (customer tenure)
    df['tenure_days'] = (reference_date - df['signup_date']).dt.days

    # Days since last purchase (recency)
    df['days_since_purchase'] = (reference_date - df['last_purchase_date']).dt.days

    # Signup month and day of week (seasonal patterns)
    df['signup_month'] = df['signup_date'].dt.month
    df['signup_dayofweek'] = df['signup_date'].dt.dayofweek

    # Is the customer relatively new? (binary threshold)
    df['is_new_customer'] = (df['tenure_days'] < 90).astype(int)

    # Recency buckets (categorical from continuous)
    df['recency_bucket'] = pd.cut(
        df['days_since_purchase'],
        bins=[0, 7, 30, 90, 180, float('inf')],
        labels=['week', 'month', 'quarter', 'half_year', 'inactive']
    )

    return df

df = engineer_temporal_features(df)
print("Temporal features:")
print(df[['tenure_days', 'days_since_purchase', 'signup_month', 'recency_bucket']].head())

Notice how we've created multiple views of the same temporal information. Raw days since purchase is useful for precise calculations, but the bucketed version captures the intuition that there's a meaningful difference between "bought last week" and "bought last month", while the difference between "bought 90 days ago" and "91 days ago" is negligible.

Ratio and Interaction Features

Often the relationship between two features is more informative than either feature alone. A customer who has made 10 purchases and 5 returns tells a different story than one with 100 purchases and 5 returns, even though both have the same return count.

Creating ratio and interaction features
def engineer_ratio_features(df):
    """Create features from ratios and interactions."""
    df = df.copy()

    # Average order value
    df['avg_order_value'] = df['total_spend'] / df['total_purchases']

    # Return rate
    df['return_rate'] = df['returns'] / df['total_purchases']

    # Purchase frequency (purchases per month of tenure)
    df['purchase_frequency'] = df['total_purchases'] / (df['tenure_days'] / 30).clip(lower=1)

    # Email engagement rate
    df['email_engagement'] = df['email_opens'] / df['emails_sent'].clip(lower=1)

    # Support burden (tickets per purchase)
    df['support_ratio'] = df['support_tickets'] / df['total_purchases']

    # Spend velocity (spend per day of tenure)
    df['spend_velocity'] = df['total_spend'] / df['tenure_days'].clip(lower=1)

    return df

df = engineer_ratio_features(df)
print("Ratio features:")
print(df[['avg_order_value', 'return_rate', 'purchase_frequency', 'email_engagement']].describe())

The key insight here is normalisation. Raw totals conflate volume with behaviour. Return rate tells you about customer satisfaction regardless of how many purchases they've made. Spend velocity compares customers fairly regardless of how long they've been around.

Aggregation Features

When you have transactional data (multiple rows per entity), aggregating to create entity-level features is essential. You might compute means, sums, counts, or more exotic aggregations like the time between events.

Aggregation from transactional data
def aggregate_transactions(transactions_df, customer_id_col='customer_id'):
    """Aggregate transaction-level data to customer level."""

    agg_features = transactions_df.groupby(customer_id_col).agg({
        # Monetary features
        'transaction_amount': ['sum', 'mean', 'std', 'min', 'max'],

        # Frequency features
        'transaction_id': 'count',

        # Category diversity
        'product_category': 'nunique',

        # Time features
        'transaction_date': ['min', 'max']
    })

    # Flatten column names
    agg_features.columns = ['_'.join(col).strip() for col in agg_features.columns]
    agg_features = agg_features.reset_index()

    # Rename for clarity
    agg_features = agg_features.rename(columns={
        'transaction_amount_sum': 'total_spend',
        'transaction_amount_mean': 'avg_transaction',
        'transaction_amount_std': 'transaction_std',
        'transaction_id_count': 'transaction_count',
        'product_category_nunique': 'category_diversity',
    })

    # Fill NaN std for customers with single transaction
    agg_features['transaction_std'] = agg_features['transaction_std'].fillna(0)

    return agg_features


# Time-based aggregations
def time_based_aggregations(transactions_df, reference_date, windows=[7, 30, 90]):
    """Create features for different time windows."""

    all_features = []

    for window in windows:
        cutoff = reference_date - timedelta(days=window)
        recent = transactions_df[transactions_df['transaction_date'] >= cutoff]

        features = recent.groupby('customer_id').agg({
            'transaction_amount': ['sum', 'count']
        })

        features.columns = [f'spend_last_{window}d', f'purchases_last_{window}d']
        all_features.append(features)

    return pd.concat(all_features, axis=1).reset_index()

Time-windowed features are particularly powerful for capturing recent behaviour. A customer's spend in the last 7 days is often more predictive of immediate future behaviour than their lifetime spend. Using multiple windows (7, 30, 90 days) lets the model learn which timeframe matters most.

Categorical Feature Engineering

Categorical variables require special treatment. Beyond basic encoding, there are creative ways to extract signal from categories.

Engineering categorical features
def engineer_categorical_features(df, target_col=None):
    """Extract features from categorical columns."""
    df = df.copy()

    # Frequency encoding: replace category with its frequency
    country_freq = df['country'].value_counts(normalize=True)
    df['country_frequency'] = df['country'].map(country_freq)

    # Target encoding (careful: requires cross-validation to avoid leakage)
    if target_col and target_col in df.columns:
        target_means = df.groupby('country')[target_col].mean()
        df['country_target_mean'] = df['country'].map(target_means)

    # Binary flags for specific categories of interest
    df['is_uk'] = (df['country'] == 'UK').astype(int)
    df['is_us'] = (df['country'] == 'US').astype(int)

    # Aggregate statistics by category
    country_spend = df.groupby('country')['total_spend'].mean()
    df['country_avg_spend'] = df['country'].map(country_spend)

    # Deviation from category mean
    df['spend_vs_country_avg'] = df['total_spend'] - df['country_avg_spend']

    return df

df = engineer_categorical_features(df)
print("Categorical features:")
print(df[['country', 'country_frequency', 'country_avg_spend', 'spend_vs_country_avg']].head(10))

The "deviation from category mean" feature is particularly clever. It captures whether a customer spends more or less than typical for their country, which might be more predictive than raw spend. Context matters.

Polynomial and Nonlinear Features

Sometimes relationships between features and targets aren't linear. Creating polynomial features allows linear models to capture nonlinear patterns.

Polynomial features
from sklearn.preprocessing import PolynomialFeatures

def create_polynomial_features(df, columns, degree=2, interaction_only=False):
    """Create polynomial features from selected columns."""

    poly = PolynomialFeatures(degree=degree, interaction_only=interaction_only,
                              include_bias=False)

    # Fit and transform
    poly_features = poly.fit_transform(df[columns])

    # Get feature names
    feature_names = poly.get_feature_names_out(columns)

    # Create DataFrame
    poly_df = pd.DataFrame(poly_features, columns=feature_names, index=df.index)

    # Remove original columns (they're duplicated in poly output)
    poly_df = poly_df.drop(columns=columns)

    return pd.concat([df, poly_df], axis=1)


# Selective polynomial features
numeric_cols = ['total_spend', 'total_purchases', 'tenure_days']
df_poly = create_polynomial_features(df, numeric_cols, degree=2, interaction_only=True)

print("Interaction features:")
print([col for col in df_poly.columns if ' ' in col])

Setting interaction_only=True gives you products of features (like total_spend × total_purchases) without squares, which often captures the most useful nonlinearities while keeping the feature count manageable.

Domain-Specific Features

The most powerful features often come from domain knowledge. These are features that make sense in the context of your specific problem and wouldn't be discovered by automated feature generation.

RFM features for e-commerce
def create_rfm_features(df):
    """
    Create Recency-Frequency-Monetary (RFM) features.

    RFM is a classic customer segmentation framework from marketing.
    """
    df = df.copy()

    # Score each dimension from 1-5 using quintiles

    # Recency: lower is better (more recent)
    df['R_score'] = pd.qcut(df['days_since_purchase'], q=5,
                            labels=[5, 4, 3, 2, 1], duplicates='drop')

    # Frequency: higher is better
    df['F_score'] = pd.qcut(df['total_purchases'].rank(method='first'), q=5,
                            labels=[1, 2, 3, 4, 5], duplicates='drop')

    # Monetary: higher is better
    df['M_score'] = pd.qcut(df['total_spend'].rank(method='first'), q=5,
                            labels=[1, 2, 3, 4, 5], duplicates='drop')

    # Combined RFM score
    df['RFM_score'] = (df['R_score'].astype(int) +
                       df['F_score'].astype(int) +
                       df['M_score'].astype(int))

    # RFM segment string
    df['RFM_segment'] = (df['R_score'].astype(str) +
                         df['F_score'].astype(str) +
                         df['M_score'].astype(str))

    # Named segments based on RFM patterns
    def segment_name(row):
        r, f, m = int(row['R_score']), int(row['F_score']), int(row['M_score'])

        if r >= 4 and f >= 4:
            return 'champions'
        elif r >= 3 and f >= 3 and m >= 3:
            return 'loyal_customers'
        elif r >= 4 and f <= 2:
            return 'new_customers'
        elif r <= 2 and f >= 4:
            return 'at_risk'
        elif r <= 2 and f <= 2:
            return 'hibernating'
        else:
            return 'potential_loyalist'

    df['customer_segment'] = df.apply(segment_name, axis=1)

    return df

df = create_rfm_features(df)
print("RFM segments:")
print(df['customer_segment'].value_counts())

RFM segmentation has been used in marketing for decades because it works. The features capture real patterns in customer behaviour. Domain knowledge tells us that recent, frequent, high-value customers behave differently from dormant, infrequent, low-value ones. We encode that knowledge directly into features.

Feature Selection

More features isn't always better. Too many features can cause overfitting, slow training, and obscure which factors actually matter. After engineering features, selecting the best ones is crucial.

Feature selection techniques
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier

def select_features(X, y, method='importance', n_features=20):
    """Select top features using various methods."""

    if method == 'f_score':
        # ANOVA F-value for classification
        selector = SelectKBest(f_classif, k=n_features)
        selector.fit(X, y)
        scores = pd.Series(selector.scores_, index=X.columns)

    elif method == 'mutual_info':
        # Mutual information (captures nonlinear relationships)
        selector = SelectKBest(mutual_info_classif, k=n_features)
        selector.fit(X, y)
        scores = pd.Series(selector.scores_, index=X.columns)

    elif method == 'importance':
        # Random forest feature importance
        rf = RandomForestClassifier(n_estimators=100, random_state=42)
        rf.fit(X, y)
        scores = pd.Series(rf.feature_importances_, index=X.columns)

    # Return ranked features
    return scores.sort_values(ascending=False)


# Correlation-based removal
def remove_correlated_features(df, threshold=0.95):
    """Remove features with correlation above threshold."""
    corr_matrix = df.corr().abs()

    # Upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

    # Find features with correlation greater than threshold
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]

    print(f"Removing {len(to_drop)} highly correlated features: {to_drop}")

    return df.drop(columns=to_drop)

Different selection methods capture different aspects. F-scores measure linear association with the target. Mutual information captures any statistical dependency, including nonlinear ones. Feature importance from tree models reflects which features actually helped make good splits during training.

The Mindset of Feature Engineering

I've given you techniques, but the deeper lesson is about how to think. When you look at a dataset, ask yourself: what story does each feature tell? What stories are missing?

Every feature should answer a question relevant to your prediction task. If you're predicting customer churn, ask: what signals dissatisfaction? What indicates engagement? What patterns precede people leaving? Then engineer features that capture those signals.

Don't be afraid to create many features and then select the best ones. Feature engineering is exploratory. Some ideas won't pan out. Others will surprise you with their predictive power. The only way to know is to try.

And remember: domain knowledge is your secret weapon. The best features often come not from clever mathematics but from deep understanding of the problem domain. Talk to experts. Understand the business. Let that knowledge guide your feature creation.

The difference between a good model and a great one is often just a handful of well-crafted features that capture something real about the world. Finding those features is the art of feature engineering.

Link copied to clipboard