Prompt Engineering in Practice · Prompt Evaluation
Eval sets, ground truth, metrics
Prompt Evaluation
Introduction
Why eyeballing isn't enough, how to build an eval set, what a golden set is, which metrics to pick (exact match, F1, pass@k, BERTScore, rubric scoring) and how to avoid common traps (cherry-picking, leakage, imbalanced classes).