Robots Atlas>ROBOTS ATLAS
GR00T N1.5

GR00T N1.5

N1.5 (3B) · Family: GR00T
NVIDIA open foundation model for humanoid robots, successor of GR00T N1. Flow matching transformer architecture with pre-trained SigLip2 (vision) and T5 (language) encoders.
✓ Active⏳ Limited access⚖ Open weightsRobotics foundation modelVision-Language-Action model📁 GR00T
Parameters
3B
parameters
Access:DownloadDeployment:💻 Local📱 On-device

Overview

NVIDIA Isaac GR00T N1.5 is an open foundation model for generalized humanoid robot reasoning and skills. The cross-embodiment model takes multimodal input (vision and language) and outputs continuous control actions. It is the 3B-parameter successor to GR00T N1 (2B).

Architecture

GR00T N1.5 uses a pre-trained Vision Transformer (SigLip2) to encode robot camera frames and a pre-trained Transformer (T5) to encode text instructions. Proprioception is encoded by an MLP indexed by embodiment ID, with padding to a configured max length to handle variable-dimension proprioceptive vectors.

The action sequence is modeled by a flow matching transformer (implemented as a Diffusion Transformer / DiT with adaptive layernorm conditioning). The transformer interleaves self-attention over proprioception and actions with cross-attention to vision and language embeddings. Compared to N1, the MLP connector between vision-language features and the DiT was modified, and the model was trained jointly with flow matching and world-modeling objectives.

Inference

At inference time the policy samples a Gaussian noise vector and iteratively reconstructs a continuous-value action using velocity prediction.

Availability and license

GR00T-N1.5-3B weights are publicly available on Hugging Face. License: NVIDIA One-Way Noncommercial License — the model is ready for non-commercial use. Runs on Linux with the PyTorch runtime; supported microarchitectures: NVIDIA Ampere, Blackwell, Hopper, Lovelace, and Jetson.

Classification
Robotics foundation modelVision-Language-Action model
Family: GR00T
Access & deployment
Download
LocalOn-device
Weights: Open weights
Key parameters
🧩 Parameters: 3B
✓ Fine-tuning
📥 Input: text, image, robot sensors, robot state data
Robotics
Bimanual manipulationDexterous manipulationRobot manipulationEmbodied task planningRobot controlScene understandingVisual grounding

Technical specification

Parameters
3B
parameters
License
NVIDIA One-Way Noncommercial License
Hardware requirements
Supported NVIDIA microarchitectures: Ampere, Blackwell, Hopper, Lovelace, Jetson. Runtime: PyTorch. OS: Linux.
Features:Fine-tuning
Modalities
⬇ Input
textimagerobot_sensorsrobot_state_data
⬆ Output
robot_actionsmotion_trajectoriesmanipulator_control

Capabilities and applications

Native model capabilities
Multimodal understanding
Category: multimodal
Image understanding
Category: vision
Reasoning
Category: reasoning
Planning
Category: planning
Multi-step reasoning
Category: reasoning
Robotics
Bimanual manipulationDexterous manipulationRobot manipulationEmbodied task planningRobot controlScene understandingVisual grounding

Technical architecture

Deployment and security