Robotics

VLA

Key innovation

Extends pretrained vision-language models (VLMs) with the ability to directly generate robot action tokens via joint fine-tuning on internet data and robot trajectories, enabling knowledge transfer from the web to physical control without separate planning or control modules.

How it works

The model receives visual inputs (camera images) and language instructions, and produces action tokens (e.g., robot joint positions). The architecture is based on a multimodal transformer trained on observation-instruction-action triplets.

Problem solved

Robots require the integration of visual perception, natural language understanding, and motor action planning in a single system. VLA unifies these three modalities in one model.

Components

Enkoder wizyjnyVisual input tokenization — converting observational images into vector representations compatible with the language backbone

Processes raw RGB images from robot cameras into sequences of visual tokens. Typically based on a Vision Transformer (ViT) or a convolutional network. More recent VLA architectures apply feature fusion from multiple visual backbones (e.g., DINOv2 + SigLIP in OpenVLA) to improve both spatial and semantic understanding.

ViT z CLIP

DINOv2 + SigLIP (fusion)

EfficientNet + FiLM

Official

Language Backbone (LLM/VLM)Instruction understanding, contextual reasoning, and action token generation from visual and language inputs.

A large language model or vision-language model forming the core of the VLA architecture. It processes a token sequence composed of visual tokens from the encoder, text instruction tokens, and action history tokens, and generates an output sequence consisting of action tokens.

PaLM-E (12B)

PaLI-X (5B/55B)

LLaMA 2 (7B)

Gemma-2B

Official

Action decoder / action output headConverts language model outputs into executable robot control signals (velocities, positions, torques).

The component responsible for converting the backbone's output representation into concrete robot control signals. In the tokenized approach, action tokens are mapped to discrete action bin values (e.g., 256 bins per dimension). In the continuous approach, a diffusion head or flow-matching head generates continuous action vectors.

Discrete action tokens

Diffusion / flow-matching head

MLP Head

Official

Vision-Language ProjectorVisual-language feature space alignment — enables the LLM to process visual embeddings as text tokens.

A linear layer or MLP that maps the output dimension of the visual encoder to the token space dimension of the language backbone (d_model). This enables visual tokens to be integrated with text tokens into a single sequence processed by the LLM.

Official

Implementation

Reference implementations

OpenVLA (Stanford)

Python · Stanford / Moo Jin Kim et al.

Official

LeRobot (Hugging Face)

Python · Hugging Face

SmolVLA (Hugging Face)

Python · Hugging Face

Implementation pitfalls

Control frequency too low for precision tasksHigh

VLA models built on large LLMs generate actions at 1–6 Hz, which is insufficient for tasks requiring smooth manipulation — such as folding, screwing, or assembly — that typically demand >50 Hz. This low frequency leads to oscillations, latency, and motion instability.

Fix:Use a dual-system architecture with a fast action module (flow-matching, diffusion). Implement action chunking — the model generates N steps ahead and executes them sequentially without additional LLM queries.

Training-Deployment Distribution ShiftHigh

VLA models trained on demonstrations collected under specific conditions (lighting, background, camera, robot configuration) generalize poorly to new environments. Changing the camera, viewing angle, background, or robot platform can drastically reduce performance.

Fix:Collect training data with visual augmentation (lighting, background, and viewpoint variation). Apply PEFT (LoRA) for rapid fine-tuning to new environments with minimal demonstrations. Use multi-embodiment datasets (Open X-Embodiment) to improve generalization.

Action Discretization Limits — Loss of PrecisionMedium

Discretizing the action space into 256 bins (as in RT-2 and OpenVLA) introduces quantization error, which is especially noticeable in tasks requiring sub-millimeter precision. Converting continuous trajectories into tokens can lose important motor details.

Fix:Use continuous action decoding via diffusion or flow-matching instead of discrete tokens for precision-demanding tasks. Alternatively, increase the number of bins or apply adaptive discretization.

Trade-off Between Catastrophic Forgetting and Knowledge TransferHigh

When fine-tuning a VLM on robotics data, the model may lose the general language and visual capabilities of the pretrained VLM (catastrophic forgetting). RT-2 addresses this through co-fine-tuning on robotics and internet data simultaneously — omitting this mixture degrades the model.

Fix:Apply co-fine-tuning with robotic and internet data mixed in appropriate proportions. With PEFT (LoRA), freezing the LLM backbone preserves VLM knowledge while training action generation.

Hardware requirements preventing on-robot deploymentHigh

Models in the 7B–55B range require A100-class GPUs (40–80 GB VRAM) or an external GPU server. Direct deployment on resource-constrained robot hardware (Jetson Orin, CPU) is not feasible without quantization or distillation.

Fix:Use INT4/INT8 quantization (without accuracy loss per OpenVLA). Train smaller models (SmolVLA 450M). Apply a dual-system architecture with a lightweight action module deployed on-robot and a heavy VLM on a remote server.

Evolution

Original paper · 2023 · CoRL 2023 (Conference on Robot Learning, PMLR 229) · Anthony Brohan

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Danny Driess, Chelsea Finn, Karol Hausman, Brian Ichter, Sergey Levine, Igor Mordatch, Karl Pertsch, Pierre Sermanet, Ted Xiao, Tianhe Yu, Brianna Zitkovich

2022

RT-1 — Robotics Transformer for real-time control

Inflection point

Brohan et al. (Google) publish RT-1 — a Transformer trained on 130k robot demonstrations with conditioned text input. It is the first large-scale model combining vision, language, and robot control, but without pretraining on internet-scale data.

RT-1: Robotics Transformer for Real-World Control at Scale (paper)

2023

RT-2 — first VLA model transferring web knowledge to robot control

Inflection point

Zitkovich, Brohan et al. (Google DeepMind) formalized the VLA paradigm by co-fine-tuning PaLI-X and PaLM-E on robotic and internet-scale tasks. Actions are encoded as text tokens. The paper coined the term "vision-language-action model" and demonstrated emergent reasoning on novel tasks without additional training data.

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control (paper)

2023

Open X-Embodiment — multi-platform robotics dataset

Inflection point

The collaboration of 21 institutions produced Open X-Embodiment — a dataset of ~1M trajectories from 22 robot types. It enables VLA training across diverse embodiments and tasks, and serves as a foundational resource for RT-X and OpenVLA.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models (paper)

2024

OpenVLA — open-source 7B-parameter VLA

Inflection point

Kim et al. (Stanford) publish OpenVLA — an open-source 7B VLA built on LLaMA 2 + DINOv2 + SigLIP, trained on 970k trajectories from Open X-Embodiment. It outperforms the closed RT-2-X (55B) while using 7× fewer parameters. It is the first open platform for VLA research with PEFT and quantization support.

OpenVLA: An Open-Source Vision-Language-Action Model (paper)

2024

π0 (Physical Intelligence) — VLA with continuous diffusion-based output

Inflection point

Black et al. (Physical Intelligence) publish π0 — a VLA with a Gemma-2B backbone and a flow-matching action head in place of discrete tokens, achieving higher motor precision on dexterity-demanding tasks such as folding clothes and washing dishes.

π0: A Vision-Language-Action Flow Model for General Robot Control (paper)

2025

Dual-system VLA — Helix (Figure AI) and Groot N1 (NVIDIA)

Dual-model architecture: a slower VLM acts as a high-level planner, paired with a fast action-generation module for high-frequency control. Figure AI (Helix) and NVIDIA (Groot N1) demonstrate dual-system VLAs for humanoids operating in real time.

Hyperparameters (configurable axes)

Visual-Language BackboneCritical

Selection of a pretrained VLM as the backbone of a VLA. This choice determines reasoning capabilities, model size, and hardware requirements.

PaLM-E 12BRT-2.

LLaMA 2 7BOpenVLA.

Gemma-2Bπ0 (Physical Intelligence).

Output Action RepresentationCritical

The method by which the model encodes robot actions: discrete tokens (bins) or continuous output (diffusion, flow matching).

discrete_tokens_256binsRT-2, OpenVLA — simplicity, compatibility with the LLM tokenizer.

continuous_diffusionπ0 — higher motor precision, higher latency.

Training data mixtureHigh

The ratio of robotic data (demonstration trajectories) to internet data (vision-language tasks). This affects the balance between language understanding and motor generalization.

co-fine-tuning: robot + VQA + OKVQA + captionRT-2 — joint fine-tuning on web and robotics tasks.

970k robot trajectories (Open X-Embodiment)OpenVLA — robotics-only data following VLM pretraining.

Control FrequencyHigh

The frequency at which a VLA generates and executes actions. It is constrained by the model's inference speed and system architecture (single-model vs. dual-system).

1–6 HzTypical range for a single-model 7B VLA (OpenVLA on RTX 4090).

50+ HzRequired for precise manipulation; achievable via dual-system VLA (Helix, Groot N1).

Architecture System TypeHigh

Whether a VLA operates as a single end-to-end model or a dual-system architecture with separate planning and execution components.

single-modelRT-2, OpenVLA, π0 — simplicity through a single model handling both perception and action.

dual-system (slow VLM + fast action module)Helix (Figure AI), Groot N1 (NVIDIA) — improved precision and higher control frequency.

VLA

How it works

Problem solved

Components

Implementation

Evolution

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements