Classic vision ZSL: (1) Each class (seen and unseen) is assigned a semantic vector a_c ∈ ℝ^d — binary attributes, class-name embeddings (Word2Vec, GloVe) or descriptive sentences. (2) A mapping function f: X → ℝ^d from images into the semantic space is trained using only seen classes S, typically by minimizing a compatibility loss (cosine, ranking) between f(x) and a_y for ground-truth y, or by training an attribute classifier directly (Lampert's DAP/IAP). (3) At inference, for a test image x, ĉ = argmax_{c ∈ U} sim(f(x), a_c). Modern CLIP-style zero-shot: a contrastive learner trains a joint image-text embedding space on a huge corpus of pairs (Conceptual Captions, LAION, WIT). Zero-shot classification compares the image embedding with text-prompt embeddings of each class — no fine-tuning needed. Zero-shot prompting in LLMs: the model is given a task instruction in natural language ("Translate this sentence into French:") and performs it because pretraining on a massive corpus has covered similar patterns. The absence of demonstrations distinguishes zero-shot from few-shot / in-context learning.
How to recognize classes or perform tasks for which labeled data cannot be collected — because examples are scarce (rare species, rare diseases), continuously emerging (new products, new user intents) or because annotation is prohibitively expensive. ZSL transfers knowledge from seen to unseen classes/tasks via a shared semantic representation.
Shared vector space in which each class can be described independently of labeled images — attributes, word embeddings, embeddings of textual descriptions, or text-encoder outputs.
Vector describing each class c — in classical ZSL handcrafted attributes (Animals with Attributes, CUB); in modern ZSL the embedding of a prompt such as "a photo of a {class}".
Function s(x, c) measuring the fit of input x to class c via similarity in the semantic space (cosine, dot product, ranking) — used during both training and inference.
Network encoding the input (image, audio, text) into a vector comparable with class semantic vectors — ResNet/ViT in CLIP, an LLM encoder for zero-shot NLP.
A very common ZSL pitfall: classes in U appear in pretraining data (e.g. an ImageNet-pretrained backbone where U ⊂ ImageNet). Reported numbers are then inflated.
A model trained on seen classes is heavily biased by softmax toward them; in GZSL almost everything is classified into S and almost nothing into U.
In high-dimensional spaces a few classes become "hubs" — the nearest neighbor of a disproportionate fraction of queries — degrading nearest-neighbor classification.
In CLIP-style ZSL, small changes to the prompt template ("a photo of a {class}" vs "{class}") shift accuracy by several percentage points.
Class-name embeddings for rare classes (e.g. obscure species) are poorly trained in the text corpus — ZSL cannot represent them well.
First explicit formulation of ZSL — learning new classification tasks without any examples of the target class, via task descriptors.
Two parallel papers grounding ZSL in attribute-based image classification; AwA becomes the standard benchmark.
Frome et al. (Google) replace hand-crafted attributes with Word2Vec word embeddings, opening ZSL to ImageNet-scale settings.
Standardization of ZSL evaluation and introduction of the generalized zero-shot protocol, revealing strong bias toward seen classes.
Brown et al. show that large LLMs perform tasks without fine-tuning from prompt instructions alone — ZSL moves from vision into mainstream NLP.
Radford et al. (OpenAI) establish contrastive pretraining on 400M image-text pairs as the standard for zero-shot classification; CLIP matches supervised baselines on ImageNet without a single ImageNet label.
ZSL extends from classification to detection, segmentation, generation, and robot control — "open-vocabulary" becomes the practical synonym of zero-shot.
Time complexity: O(|U| · d) per prediction (after embedding computation). Space complexity: O(|C| · d).
ZSL does not introduce its own execution paradigm — it inherits one from the host architecture (dense Transformer, conditional MoE, etc.). Similarity-based classification is a dense matmul.
Zero-shot classification itself is a matrix operation (matmul of image embedding with class-embedding bank), ideally parallelized on GPU. Encoder training (e.g. CLIP) is massively parallel over image-text batches.
Hand-crafted attributes, word embeddings (Word2Vec/GloVe), LLM text embeddings, Wikipedia descriptions — fundamentally affects transfer quality.
CLIP-style prompts such as "a photo of a {class}" vs "a satellite image of a {class}". Prompt ensembling raises zero-shot accuracy by several percentage points.
Bilinear (DeViSE, ALE), ranking, cosine + temperature (CLIP). Affects scaling with the number of classes.
In generalized ZSL it is critical to mitigate the bias toward seen classes (calibrated stacking, softmax calibration).
Zero-shot classification is encoder forward + matmul of the embedding with the class bank — operations ideal for tensor cores. CLIP/SigLIP pretraining requires GPU clusters.
Google trains its multimodal foundation models (ALIGN, PaLI, Gemini) on TPU — contrastive pretraining maps well to systolic arrays.
Zero-shot inference with quantized CLIP (e.g. ONNX Runtime, GGML) is feasible on CPU AVX2/AVX-512 for smaller models (ViT-B).
The ZSL algorithm itself (compare embedding with class bank) is hardware-agnostic — the encoder choice determines the cost.