NVIDIA's open 7B reasoning vision-language model for Physical AI and robotics. Understands space, time and physics, and serves as a planning model for embodied agents.
Parameters
7B (โ8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)
parameters
Release date
17 May 2025
Access:DownloadAPIHostedDeployment:๐ป Localโ Cloud
Overview
Access & deployment
DownloadAPIHosted
LocalCloud
Weights: Open weights
Key parameters
๐งฉ Parameters: 7B (โ8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)
โ Fine-tuning
๐ฅ Input: text, image, video
Robotics
Embodied task planningScene understandingSpatial reasoningObject affordance understandingMotion planningSpatial prediction
Platforms
Technical specification
Parameters
7B (โ8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)
parameters
License
NVIDIA Open Model License (commercial use)
Hardware requirements
Inference via vLLM on NVIDIA Hopper / Blackwell GPUs (tested: H100, A100, GB200), BF16 precision. The 7B model fits on a single server-grade GPU (e.g. H100 80GB). Operating system: Linux.
Features:โ Fine-tuning
Modalities
โฌ Input
textimagevideo
โฌ Output
text
Capabilities and applications
Native model capabilities
Reasoning
Category: reasoning
Multi-step reasoning
Category: reasoning
Video understanding
The model's ability to analyse and interpret video content โ recognising actions, motion, events and relationships between objects over time.
Category: video
Vision encoder
The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.
Category: vision
Planning
The model's ability to determine a sequence of actions leading to a goal โ predicting the consequences of actions and selecting an optimal path in a given environment.
Category: planning
Robotics
Embodied task planningScene understandingSpatial reasoningObject affordance understandingMotion planningSpatial prediction
Application domains
Benchmark results
7 benchmarks
RoboVQA
accuracy ยท embodied reasoning benchmark
87.3%
๐ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
AV (Autonomous Vehicle)
accuracy ยท embodied reasoning benchmark
70.8%
๐ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
BridgeDataV2
accuracy ยท embodied reasoning benchmark
63.7%
๐ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
AgiBot
accuracy ยท embodied reasoning benchmark
48.9%
๐ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
HoloAssist
accuracy ยท embodied reasoning benchmark
62.7%
๐ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
RoboFail
accuracy ยท held-out generalization benchmark
57.2%
๐ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
Embodied Reasoning (Average)
accuracy ยท average across embodied reasoning benchmarks
65.1%
๐ Cosmos-Reason1 model card / paper (arXiv:2503.15558)
Technical architecture
Core Architecture
Model Form
Deployment and security
โ Available on platforms
