The dataset contains questions from official exams, grouped by type: multiple-choice (MC), free-text, and math problems. Models are evaluated both zero-shot and few-shot. Results are compared against average human performance for each exam.
Artificial benchmarks do not reflect the difficulty of tasks AI models may encounter in real-world applications. AGIEval places evaluation in the context of human cognition and decisions by using exams designed to assess human competencies.
A subset of tasks is in Chinese, which may skew results for models weaker in that language.
Zhong et al. from Microsoft Research introduce the qualification exam benchmark. GPT-4 surpasses human average on SAT and LSAT.
Text-based benchmark independent of hardware.