Machine Learning · Ensembles and Model Selection

Scikit-learn pipeline and model selection

Ensembles and Model Selection

Introduction

A scikit-learn Pipeline (Pedregosa et al. 2011) is an object chaining preprocessing steps and an estimator into one callable estimator. Two key things it solves: (1) data leakage — when StandardScaler.fit() or SimpleImputer.fit() computes statistics on the entire dataset (including the validation fold), information leaks from the val fold into training; a Pipeline forces fit ONLY on the training fold in each cross-validation fold. (2) reproducible deployment — one serialized Pipeline encompasses the full preprocessing, so the same thing happens in production as in training. ColumnTransformer allows different preprocessing per column (numerical → StandardScaler, categorical → OneHotEncoder). Model selection: GridSearchCV (exhaustive over a hyperparameter grid), RandomizedSearchCV (random samples from distributions; Bergstra & Bengio 2012 showed that for high-dimensional hyperparameter spaces random search is more efficient than grid). Nested cross-validation (Cawley & Talbot 2010) — outer CV estimates generalization, inner CV selects hyperparameters; solves the optimistic bias when using the same data for tuning and evaluation.