Robots Atlas>ROBOTS ATLAS
Cosmos Reason

Cosmos Reason

1ย ยทย Family: Cosmos
NVIDIA's open 7B reasoning vision-language model for Physical AI and robotics. Understands space, time and physics, and serves as a planning model for embodied agents.
๐Ÿ“ฆ Archivedโœ“ Public accessโš– Open weightsReasoning modelMultimodal๐Ÿ“ Cosmos
Parameters
7B (โ‰ˆ8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)
parameters
Release date
17 May 2025
Access:DownloadAPIHostedDeployment:๐Ÿ’ป Localโ˜ Cloud

Overview

Cosmos Reason is an open, customizable reasoning vision-language model (VLM) developed by NVIDIA as part of the Cosmos platform for Physical AI and robotics. It enables robots and vision AI agents to reason using prior knowledge, physics understanding and common sense in order to understand and act in the real world. The model understands space, time and fundamental physics, and can serve as a planning model that reasons about what steps an embodied agent might take next.

Architecture

A multi-modal LLM consisting of a Vision Transformer (ViT) as the vision encoder and a dense Transformer as the language model. Network architecture: Qwen2.5-VL-7B-Instruct โ€” Cosmos Reason1-7B is post-trained on top of Qwen2.5-VL-7B-Instruct. Parameter count: ViT 675.76M + LLM 7.07B + projection layer 545M (about 8B total). Video/image is converted into tokens by the vision encoder and a projector, combined with the text prompt and fed into the core model, which uses chain-of-thought to answer step by step.

Training

The model is post-trained on physical common-sense and embodied-reasoning data using supervised fine-tuning (SFT) and reinforcement learning (RL). It uses chain-of-thought reasoning to understand world dynamics without human annotations. Training data includes RoboVQA, BridgeDataV2, AgiBot, HoloAssist and autonomous-vehicle (AV) driving data collected and annotated by NVIDIA.

Applications

Data curation and annotation (automating high-quality annotation of massive datasets), robot planning and reasoning (the brain for vision-language-action models โ€” a robot breaks complex commands into tasks and executes them), and video analytics AI agents (extracting insights and root-cause analysis over large volumes of video). The model is ready for commercial use.

Availability

Cosmos Reason1-7B weights are publicly available on Hugging Face under the NVIDIA Open Model License (commercial use permitted). Post-training code is in the nvidia-cosmos/cosmos-reason1 repository. Runtime: vLLM. Tested hardware: H100, A100, GB200 (NVIDIA Hopper / Blackwell), BF16 precision. The next generation โ€” Cosmos Reason 2 โ€” was released in October 2025. The Cosmos family was subsequently merged into the Cosmos 3 omni-model (COMPUTEX 2026).

Classification
Reasoning modelMultimodal
Family: Cosmos
Access & deployment
DownloadAPIHosted
LocalCloud
Weights: Open weights
Key parameters
๐Ÿงฉ Parameters: 7B (โ‰ˆ8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)
โœ“ Fine-tuning
๐Ÿ“ฅ Input: text, image, video
Robotics
Embodied task planningScene understandingSpatial reasoningObject affordance understandingMotion planningSpatial prediction
Platforms

Technical specification

Parameters
7B (โ‰ˆ8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)
parameters
License
NVIDIA Open Model License (commercial use)
Hardware requirements
Inference via vLLM on NVIDIA Hopper / Blackwell GPUs (tested: H100, A100, GB200), BF16 precision. The 7B model fits on a single server-grade GPU (e.g. H100 80GB). Operating system: Linux.
Features:โœ“ Fine-tuning
Modalities
โฌ‡ Input
textimagevideo
โฌ† Output
text

Capabilities and applications

Native model capabilities
Reasoning
Category: reasoning
Multi-step reasoning
Category: reasoning
Video understanding
The model's ability to analyse and interpret video content โ€” recognising actions, motion, events and relationships between objects over time.
Category: video
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Planning
The model's ability to determine a sequence of actions leading to a goal โ€” predicting the consequences of actions and selecting an optimal path in a given environment.
Category: planning
Robotics
Embodied task planningScene understandingSpatial reasoningObject affordance understandingMotion planningSpatial prediction

Benchmark results

7 benchmarks
RoboVQA
accuracy ยท embodied reasoning benchmark
87.3%
๐Ÿ“„ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
AV (Autonomous Vehicle)
accuracy ยท embodied reasoning benchmark
70.8%
๐Ÿ“„ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
BridgeDataV2
accuracy ยท embodied reasoning benchmark
63.7%
๐Ÿ“„ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
AgiBot
accuracy ยท embodied reasoning benchmark
48.9%
๐Ÿ“„ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
HoloAssist
accuracy ยท embodied reasoning benchmark
62.7%
๐Ÿ“„ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
RoboFail
accuracy ยท held-out generalization benchmark
57.2%
๐Ÿ“„ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
Embodied Reasoning (Average)
accuracy ยท average across embodied reasoning benchmarks
65.1%
๐Ÿ“„ Cosmos-Reason1 model card / paper (arXiv:2503.15558)

Technical architecture

Core Architecture
Training Techniques

Deployment and security

โ˜ Available on platforms