Inference

ZSL

2008ActivePublished: 28 May 2026Updated: 28 May 2026Published

Key innovation

Making predictions for classes or tasks the model has never seen during training, via a shared semantic space (attributes, text embeddings) or natural-language instructions — without a single labeled example of the target class.

How it works

Classic vision ZSL: (1) Each class (seen and unseen) is assigned a semantic vector a_c ∈ ℝ^d — binary attributes, class-name embeddings (Word2Vec, GloVe) or descriptive sentences. (2) A mapping function f: X → ℝ^d from images into the semantic space is trained using only seen classes S, typically by minimizing a compatibility loss (cosine, ranking) between f(x) and a_y for ground-truth y, or by training an attribute classifier directly (Lampert's DAP/IAP). (3) At inference, for a test image x, ĉ = argmax_{c ∈ U} sim(f(x), a_c). Modern CLIP-style zero-shot: a contrastive learner trains a joint image-text embedding space on a huge corpus of pairs (Conceptual Captions, LAION, WIT). Zero-shot classification compares the image embedding with text-prompt embeddings of each class — no fine-tuning needed. Zero-shot prompting in LLMs: the model is given a task instruction in natural language ("Translate this sentence into French:") and performs it because pretraining on a massive corpus has covered similar patterns. The absence of demonstrations distinguishes zero-shot from few-shot / in-context learning.

Problem solved

How to recognize classes or perform tasks for which labeled data cannot be collected — because examples are scarce (rare species, rare diseases), continuously emerging (new products, new user intents) or because annotation is prohibitively expensive. ZSL transfers knowledge from seen to unseen classes/tasks via a shared semantic representation.

Components

Semantic spaceRepresentation of unseen classes

Shared vector space in which each class can be described independently of labeled images — attributes, word embeddings, embeddings of textual descriptions, or text-encoder outputs.

Class semantic vectors a_cSide information for unseen classes U

Vector describing each class c — in classical ZSL handcrafted attributes (Animals with Attributes, CUB); in modern ZSL the embedding of a prompt such as "a photo of a {class}".

Compatibility / scoring functionPrediction and loss

Function s(x, c) measuring the fit of input x to class c via similarity in the semantic space (cosine, dot product, ranking) — used during both training and inference.

Visual / input encoderMapping inputs into the semantic space

Network encoding the input (image, audio, text) into a vector comparable with class semantic vectors — ResNet/ViT in CLIP, an LLM encoder for zero-shot NLP.

Implementation

Reference implementations

OpenAI CLIP

Python (PyTorch) · OpenAI

Official

OpenCLIP

Python (PyTorch) · LAION / ML Foundations

Hugging Face — Zero-shot classification pipeline

Python · Hugging Face

Official

Animals with Attributes 2 (AwA2) — ZSL benchmark

Dataset · IST Austria

Official

Implementation pitfalls

Data leakage — "unseen" classes present in pretrainingCritical

A very common ZSL pitfall: classes in U appear in pretraining data (e.g. an ImageNet-pretrained backbone where U ⊂ ImageNet). Reported numbers are then inflated.

Fix:Use the GBU split (Xian et al. 2017) which guarantees no overlap of U classes with ImageNet. Report both seen and unseen accuracy in GZSL.

Seen vs unseen bias in generalized ZSLHigh

A model trained on seen classes is heavily biased by softmax toward them; in GZSL almost everything is classified into S and almost nothing into U.

Fix:Calibrated stacking (subtract a constant from seen-class scores), generative ZSL (synthesize pseudo-examples of unseen classes), temperature calibration.

Hubness in the embedding spaceMedium

In high-dimensional spaces a few classes become "hubs" — the nearest neighbor of a disproportionate fraction of queries — degrading nearest-neighbor classification.

Fix:Embedding normalization, mutual k-NN, mean-centering, post-hoc score rescaling.

Prompt sensitivityMedium

In CLIP-style ZSL, small changes to the prompt template ("a photo of a {class}" vs "{class}") shift accuracy by several percentage points.

Fix:Prompt ensembling (averaging embeddings of many templates), prompt learning (CoOp, CoCoOp).

Weak semantic space for rare classesMedium

Class-name embeddings for rare classes (e.g. obscure species) are poorly trained in the text corpus — ZSL cannot represent them well.

Fix:Use textual descriptions instead of bare names (Wikipedia, definitions), hand-crafted attributes, or LLM-generated attributes.

Evolution

Original paper · 2008 · AAAI 2008 · Hugo Larochelle

Zero-Data Learning of New Tasks

Hugo Larochelle, Dumitru Erhan, Yoshua Bengio

2008

Larochelle et al. — "Zero-Data Learning of New Tasks"

Inflection point

First explicit formulation of ZSL — learning new classification tasks without any examples of the target class, via task descriptors.

Zero-Data Learning of New Tasks (paper)

2009

Palatucci et al. — semantic output codes; Lampert et al. — DAP/IAP on Animals with Attributes

Inflection point

Two parallel papers grounding ZSL in attribute-based image classification; AwA becomes the standard benchmark.

Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer (paper)

2013

DeViSE — word embeddings as semantic space

Inflection point

Frome et al. (Google) replace hand-crafted attributes with Word2Vec word embeddings, opening ZSL to ImageNet-scale settings.

DeViSE: A Deep Visual-Semantic Embedding Model (paper)

2017

Xian et al. — Generalized ZSL benchmark (GBU)

Standardization of ZSL evaluation and introduction of the generalized zero-shot protocol, revealing strong bias toward seen classes.

Zero-Shot Learning — A Comprehensive Evaluation of the Good, the Bad and the Ugly (paper)

2020

GPT-3 — zero-shot prompting as a general NLP mechanism

Inflection point

Brown et al. show that large LLMs perform tasks without fine-tuning from prompt instructions alone — ZSL moves from vision into mainstream NLP.

Language Models are Few-Shot Learners (paper)

2021

CLIP — contrastive image-text pretraining for zero-shot vision

Inflection point

Radford et al. (OpenAI) establish contrastive pretraining on 400M image-text pairs as the standard for zero-shot classification; CLIP matches supervised baselines on ImageNet without a single ImageNet label.

Learning Transferable Visual Models From Natural Language Supervision (paper)

2023

Open-vocabulary detection / segmentation / robotics (OWL-ViT, SAM, RT-2)

ZSL extends from classification to detection, segmentation, generation, and robot control — "open-vocabulary" becomes the practical synonym of zero-shot.