Machine Learning · Overfitting, Underfitting, and Regularization

Cross-validation — k-fold, stratified, nested, and practical pitfalls

Overfitting, Underfitting, and Regularization

Introduction

Cross-validation (CV) is the standard procedure for estimating generalization error and selecting hyperparameters. Instead of a single train/test split, we partition data into k folds and train k times on k−1 folds while testing on the remaining one. This lesson covers: k-fold (formula and practical choice of k), stratified k-fold for imbalanced classification, LOOCV (Leave-One-Out) along with its high-variance pitfall, repeated k-fold, time-series CV (TimeSeriesSplit) preventing data leakage in temporal data, group k-fold for repeated measurements (e.g., multiple samples per patient), nested CV as the only correct method when combining hyperparameter tuning with error estimation, the Breiman 1-SE rule, and the most common mistakes: preprocessing outside the fold (data leakage), using the test set for model selection, using shuffle=True for time series. We build on: Stone 1974 (Cross-validatory choice), Geisser 1975, Kohavi 1995 ("A study of cross-validation and bootstrap"), Hastie, Tibshirani & Friedman (ESL ch. 7), Varma & Simon 2006 (bias from improper CV).