The model receives visual inputs (camera images) and language instructions, and produces action tokens (e.g., robot joint positions). The architecture is based on a multimodal transformer trained on observation-instruction-action triplets.
Robots require the integration of visual perception, natural language understanding, and motor action planning in a single system. VLA unifies these three modalities in one model.
Processes raw RGB images from robot cameras into sequences of visual tokens. Typically based on a Vision Transformer (ViT) or a convolutional network. More recent VLA architectures apply feature fusion from multiple visual backbones (e.g., DINOv2 + SigLIP in OpenVLA) to improve both spatial and semantic understanding.
Official
A large language model or vision-language model forming the core of the VLA architecture. It processes a token sequence composed of visual tokens from the encoder, text instruction tokens, and action history tokens, and generates an output sequence consisting of action tokens.
Official
The component responsible for converting the backbone's output representation into concrete robot control signals. In the tokenized approach, action tokens are mapped to discrete action bin values (e.g., 256 bins per dimension). In the continuous approach, a diffusion head or flow-matching head generates continuous action vectors.
Official
A linear layer or MLP that maps the output dimension of the visual encoder to the token space dimension of the language backbone (d_model). This enables visual tokens to be integrated with text tokens into a single sequence processed by the LLM.
Official
VLA models built on large LLMs generate actions at 1–6 Hz, which is insufficient for tasks requiring smooth manipulation — such as folding, screwing, or assembly — that typically demand >50 Hz. This low frequency leads to oscillations, latency, and motion instability.
VLA models trained on demonstrations collected under specific conditions (lighting, background, camera, robot configuration) generalize poorly to new environments. Changing the camera, viewing angle, background, or robot platform can drastically reduce performance.
Discretizing the action space into 256 bins (as in RT-2 and OpenVLA) introduces quantization error, which is especially noticeable in tasks requiring sub-millimeter precision. Converting continuous trajectories into tokens can lose important motor details.
When fine-tuning a VLM on robotics data, the model may lose the general language and visual capabilities of the pretrained VLM (catastrophic forgetting). RT-2 addresses this through co-fine-tuning on robotics and internet data simultaneously — omitting this mixture degrades the model.
Models in the 7B–55B range require A100-class GPUs (40–80 GB VRAM) or an external GPU server. Direct deployment on resource-constrained robot hardware (Jetson Orin, CPU) is not feasible without quantization or distillation.
Brohan et al. (Google) publish RT-1 — a Transformer trained on 130k robot demonstrations with conditioned text input. It is the first large-scale model combining vision, language, and robot control, but without pretraining on internet-scale data.
Zitkovich, Brohan et al. (Google DeepMind) formalized the VLA paradigm by co-fine-tuning PaLI-X and PaLM-E on robotic and internet-scale tasks. Actions are encoded as text tokens. The paper coined the term "vision-language-action model" and demonstrated emergent reasoning on novel tasks without additional training data.
The collaboration of 21 institutions produced Open X-Embodiment — a dataset of ~1M trajectories from 22 robot types. It enables VLA training across diverse embodiments and tasks, and serves as a foundational resource for RT-X and OpenVLA.
Kim et al. (Stanford) publish OpenVLA — an open-source 7B VLA built on LLaMA 2 + DINOv2 + SigLIP, trained on 970k trajectories from Open X-Embodiment. It outperforms the closed RT-2-X (55B) while using 7× fewer parameters. It is the first open platform for VLA research with PEFT and quantization support.
Black et al. (Physical Intelligence) publish π0 — a VLA with a Gemma-2B backbone and a flow-matching action head in place of discrete tokens, achieving higher motor precision on dexterity-demanding tasks such as folding clothes and washing dishes.
Dual-model architecture: a slower VLM acts as a high-level planner, paired with a fast action-generation module for high-frequency control. Figure AI (Helix) and NVIDIA (Groot N1) demonstrate dual-system VLAs for humanoids operating in real time.
Typical VLA models (7B–55B parameters) generate action tokens at 1–6 Hz on A100/RTX 4090-class GPUs, which is insufficient for tasks requiring high-frequency control (e.g., bimanual manipulation at >50 Hz). Deploying a large model directly on a robot, or routing inference through a network link to a GPU server, introduces additional latency.
Selection of a pretrained VLM as the backbone of a VLA. This choice determines reasoning capabilities, model size, and hardware requirements.
The method by which the model encodes robot actions: discrete tokens (bins) or continuous output (diffusion, flow matching).
The ratio of robotic data (demonstration trajectories) to internet data (vision-language tasks). This affects the balance between language understanding and motor generalization.
The frequency at which a VLA generates and executes actions. It is constrained by the model's inference speed and system architecture (single-model vs. dual-system).
Whether a VLA operates as a single end-to-end model or a dual-system architecture with separate planning and execution components.
Standard VLA models (RT-2, OpenVLA) use a dense Transformer backbone that processes all tokens — visual, language, and action — through every layer. There is no routing or sparse activation, in contrast to MoE-VLA variants proposed in later work.
Training is fully parallel across tokens (full trajectories are processed as sequences of vision-language tokens). Inference is sequential per action token, but visual and linguistic processing (prefill) is parallel.
VLA models based on large LLMs (7B–55B parameters) require GPUs with tensor cores for efficient inference. Training demands A100/H100-class GPU clusters (OpenVLA: 64×A100 for 14 days). Real-time inference on a robot requires at minimum an RTX 4090 (6 Hz for a 7B model).
Google DeepMind trained RT-2 (PaLM-E, PaLI-X backbone) on TPUs. TPU v4/v5 efficiently handle LLM matrix operations in VLA models.