Python — From Basics to Advanced · scikit-learn — Classical ML
Train/test split and cross-validation
scikit-learn — Classical ML
Introduction
The most important ML skill is not training the model — it is evaluating whether the model generalises beyond training data. This lesson systematises how to split data (train_test_split), how to cross-validate (KFold, StratifiedKFold, cross_val_score), when you need TimeSeriesSplit (time-ordered data) or GroupKFold (samples from the same patient/user). The main risk is data leakage: a scaler fit on the full X, target encoding computed on the full y, a random split in time-series data. Each of these gives optimistic experiment results and production failure.