HELM defines a taxonomy of scenarios (domain x task x metric) and selects a representative subset. Each of 30 models is evaluated on the same prompts under standardized conditions. Results for 7 metrics are reported per scenario and aggregated into a model profile. The platform is hosted by Stanford CRFM with public access to raw data.
Fragmentation and selectivity in LLM evaluation โ models were compared on different datasets with different metrics, making fair comparisons impossible and hiding important trade-offs (e.g. high accuracy with high toxicity).
Evaluating 30 models across 42 scenarios is computationally and financially expensive, limiting access to full evaluation.
Percy Liang and 49 co-authors introduce the framework; 30 models evaluated across 42 scenarios.
Version v2 extends the benchmark with 2023 models and new scenarios.
Evaluation framework independent of hardware architecture โ evaluation runs via API or local model inference.