Machine Learning Data Preprocessing: The Mistakes That Break Models Before Training
The model isn't the problem. Nine times out of ten, when a machine learning project falls apart — bad predictions, overfitting that training metrics didn't catch, inexplicable behavior on new data — the real failure happened two steps earlier, in preprocessing.
Poor preprocessing breaks models before they ever see a training loop.
🌐 This is a cross-post from mathisimple.com, where this guide is part of an interactive ML course covering the full supervised learning workflow.
Why Preprocessing Gets Skipped
The short version: it's unglamorous. Training a neural network feels like science. Imputing missing values feels like janitorial work.
But think about what raw data actually looks like before preprocessing. Missing entries filled with NaN. Age is stored as a string because once someone typed "35 years." The "churn" column uses 0/1 in some rows and True/False in others. One feature ranges from 0–1, the next from 0–1,000,000. You scraped the test labels before splitting.
These aren't edge cases. They're normal. And any of them can silently corrupt your model.
Step 1: Inspect Before Touching Anything
The first rule of preprocessing: don't change anything until you understand what you have.
Run a basic inspection on every new dataset:
import pandas as pd
df = pd.read_csv("customer_churn.csv")
print(df.shape) # rows × columns
print(df.dtypes) # data types (spot object columns that should be numeric)
print(df.isnull().sum()) # missing count per column
print(df.describe()) # distribution of numeric features
print(df["churn"].value_counts(normalize=True)) # class balance
That last line matters. If 97% of your "churn" labels are 0, you're working with a highly imbalanced dataset. Accuracy will be misleading. You'll need to adjust your evaluation metrics before you build anything.
Common things to look for in the inspection:
| Issue | Symptom | Example |
|---|---|---|
| Wrong types | Column is object, should be float |
"$1,200" instead of 1200 |
| Hidden missing values | Non-NaN strings that mean "unknown" | "N/A", "none", "-" |
| Outliers | Mean far from median in describe() |
Age = 847 |
| Duplicates | Row count doesn't match business logic | Same order_id appears twice |
| Class imbalance | One label dominates | 98% negative, 2% positive |
Don't rush this. The 20 minutes spent here will save you hours of debugging later.
Step 2: Handle Missing Values — Carefully
There are three honest options when a value is missing: drop the row, drop the column, or impute.
Drop the row if: less than 1% of your data is affected and the missingness is random (not systematic).
Drop the column if: more than 30–40% of values are missing and the feature isn't critical. A mostly-empty column contributes more noise than signal.
Impute if: the column is important and missingness is moderate. The method depends on the data type:
from sklearn.impute import SimpleImputer
# Numeric: use median (more robust than mean for skewed distributions)
imputer_num = SimpleImputer(strategy="median")
df["income"] = imputer_num.fit_transform(df[["income"]])
# Categorical: use most frequent value
imputer_cat = SimpleImputer(strategy="most_frequent")
df["job_type"] = imputer_cat.fit_transform(df[["job_type"]])
The mistake to avoid: using the mean for numeric imputation when the distribution is skewed. If your income column has a long right tail (most people earn $40–80K, a few earn $1M+), the mean is $120K. The median might be $55K. Imputing with the mean will inject unrealistic values and distort the feature distribution.
A subtler mistake: imputing with statistics computed across the full dataset before splitting into train and test. That's data leakage — the test set has influenced the imputation. Always fit your imputer on training data only, then apply it to test data.
Step 3: Encode Categorical Features
Machine learning models work with numbers. The encoding method you choose affects your model more than most people realize.
One-Hot Encoding
Convert each category into a binary column. Best for features with no ordinal relationship — categories that are just labels.
df = pd.get_dummies(df, columns=["job_type", "city"], drop_first=True)
# drop_first=True removes one dummy column per feature (avoids multicollinearity)
If job_type has 5 unique values, one-hot encoding creates 4 new binary columns. That's fine for 5 categories. It becomes a problem at 500 — you're adding 499 sparse columns. High-cardinality features need different treatment.
Label Encoding (Ordinal)
Assign integers to categories. Only use this when there's a genuine order that the model should respect.
from sklearn.preprocessing import OrdinalEncoder
# Education level genuinely increases: high school < bachelor's < master's < PhD
enc = OrdinalEncoder(categories=[["high school", "bachelors", "masters", "phd"]])
df["education"] = enc.fit_transform(df[["education"]])
The mistake to avoid: applying label encoding to nominal categories. If city becomes 0=Austin, 1=Boston, 2=Chicago, most models will interpret Boston as "twice Austin" and Chicago as "three times Austin." That's meaningless. Use one-hot encoding for nominal features, not label encoding.
Target Encoding (High Cardinality)
For features with hundreds of categories (zip codes, user IDs, product names), encoding each as binary is impractical. Target encoding replaces each category with the mean of the target variable for that category, learned from training data. It's powerful but requires careful regularization to avoid leakage.
Step 4: Scale Numeric Features
Some models are scale-invariant (decision trees, random forests — they split on thresholds, not distances). Others are emphatically not. kNN, SVM, PCA, neural networks, and regularized regression all compute distances or magnitudes. Unscaled features will distort their behavior.
Standardization (z-score scaling): subtract mean, divide by standard deviation. The result has mean 0 and standard deviation 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[["age", "income", "credit_score"]] = scaler.fit_transform(df[["age", "income", "credit_score"]])
Min-max scaling: compress to [0, 1] range. Sensitive to outliers (one extreme value will crowd everything else toward 0 or 1).
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[["pixel_value"]] = scaler.fit_transform(df[["pixel_value"]])
Which to use: standardization is the default for most tabular ML. Min-max makes sense when you need a bounded range (image pixel values, neural network inputs with bounded activation functions).
The critical rule: fit the scaler on training data only. Apply it to test/validation data using the training statistics. If you compute the mean and standard deviation across all data before splitting, you've leaked test information into your features.
# CORRECT
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # uses training mean/std
# WRONG — this leaks test statistics into your training preprocessing
scaler.fit(X_all)
Step 5: Split the Data — the Right Way
The train-test split is where most data leakage originates. The rule is simple: no information from the test set should influence anything in your training pipeline. It's violated constantly.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # maintains class balance in both splits
)
stratify=y is worth remembering. Without it, random splitting on an imbalanced dataset can put most of your rare class examples into training, leaving you with an unrepresentative test set.
The Preprocessing Order That Prevents Leakage
- Load raw data
- Split into train/test
- Inspect and explore training set only
- Fit all transformers (scalers, imputers, encoders) on training set
- Apply fitted transformers to test set
- Train model on training set
- Evaluate on test set
If you do feature engineering or imputation before step 2, you've already leaked.
Step 6: Feature Engineering (the Part That Often Matters Most)
Standard preprocessing is about cleaning and transforming existing features. Feature engineering creates new ones — and it's often where the real model performance gains come from.
A few patterns worth knowing:
Interaction features: multiply or combine existing columns to capture non-linear relationships. If you're predicting loan default, income / debt is more predictive than income and debt separately.
Log transform: right-skewed distributions (income, housing prices, clicks) can hurt linear models significantly. Taking log(x+1) often normalizes them.
Date decomposition: extract year, month, day-of-week, hour, days until a deadline from timestamp columns. Raw timestamps are almost never useful as-is.
Binning continuous features: convert age into [18-25, 26-35, 36-50, 51+] bins. Useful when you suspect the relationship isn't linear or when the feature has lots of noise.
None of these are always correct. They're hypotheses about what structure might be useful. You test them through cross-validation.
The Whole Pipeline in One Place
In production, use sklearn's Pipeline to prevent leakage mechanically rather than relying on discipline:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
numeric_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(drop="first", handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", numeric_pipeline, numeric_features),
("cat", categorical_pipeline, categorical_features)
])
full_pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", LogisticRegression())
])
full_pipeline.fit(X_train, y_train)
full_pipeline.score(X_test, y_test)
The pipeline fits on training data, applies to test data automatically. The leakage risk disappears.
Frequently Asked Questions
How do I know which columns need scaling?
Columns that will be used in distance-based or gradient-based models (kNN, SVM, linear regression, neural networks, PCA) need scaling. Tree-based models (decision trees, random forests, gradient boosting) do not — they split on thresholds, so the absolute scale is irrelevant. If you're unsure which model you'll end up with, scale everything. It doesn't hurt tree-based models and is required for others.
Is it okay to drop all rows with missing values?
Depends on how much data you'd lose and whether the missingness is random. Dropping 0.5% of rows is fine. Dropping 30% is dangerous — you're substantially changing your training distribution. Also consider: if missingness correlates with an important feature (e.g., income tends to be missing for high earners who don't disclose), dropping those rows introduces bias.
When should I use cross-validation instead of a single train-test split?
Almost always, when your dataset is small (< 10,000 rows). A single 80/20 split gives you one estimate of performance that's sensitive to which 20% happened to end up in test. K-fold cross-validation gives you k estimates; averaged together, they're more reliable. Use stratified k-fold for imbalanced classification problems. Reserve a final held-out test set for your final model evaluation — don't tune on cross-validation folds and then report those numbers as your test performance.
Try the Interactive Version
This preprocessing pipeline is demonstrated interactively on mathisimple.com, where you can apply each step to a live dataset and see how the distributions and model predictions change.
👉 Open the interactive preprocessing tutorial → mathisimple.com
You can:
- Introduce missing values and see different imputation strategies compared side by side
- Swap between one-hot and label encoding and watch how a kNN model's decision boundary shifts
- Apply scalers before/after splitting and observe the leakage effect on test performance
Cross-posted from mathisimple.com — interactive ML tutorials with the math, not around it.
