Cosmos Reason

1 · Family: Cosmos

NVIDIA's open 7B reasoning vision-language model for Physical AI and robotics. Understands space, time and physics, and serves as a planning model for embodied agents.

📦 Archived✓ Public access⚖ Open weightsReasoning modelMultimodal📁 Cosmos

Parameters

7B (≈8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)

parameters

Release date

17 May 2025

🏢NVIDIAProducer

Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud

Overview

Cosmos Reason is an open, customizable reasoning vision-language model (VLM) developed by NVIDIA as part of the Cosmos platform for Physical AI and robotics. It enables robots and vision AI agents to reason using prior knowledge, physics understanding and common sense in order to understand and act in the real world. The model understands space, time and fundamental physics, and can serve as a planning model that reasons about what steps an embodied agent might take next.

Architecture

A multi-modal LLM consisting of a Vision Transformer (ViT) as the vision encoder and a dense Transformer as the language model. Network architecture: Qwen2.5-VL-7B-Instruct — Cosmos Reason1-7B is post-trained on top of Qwen2.5-VL-7B-Instruct. Parameter count: ViT 675.76M + LLM 7.07B + projection layer 545M (about 8B total). Video/image is converted into tokens by the vision encoder and a projector, combined with the text prompt and fed into the core model, which uses chain-of-thought to answer step by step.

Training

The model is post-trained on physical common-sense and embodied-reasoning data using supervised fine-tuning (SFT) and reinforcement learning (RL). It uses chain-of-thought reasoning to understand world dynamics without human annotations. Training data includes RoboVQA, BridgeDataV2, AgiBot, HoloAssist and autonomous-vehicle (AV) driving data collected and annotated by NVIDIA.

Applications

Data curation and annotation (automating high-quality annotation of massive datasets), robot planning and reasoning (the brain for vision-language-action models — a robot breaks complex commands into tasks and executes them), and video analytics AI agents (extracting insights and root-cause analysis over large volumes of video). The model is ready for commercial use.

Availability

Cosmos Reason1-7B weights are publicly available on Hugging Face under the NVIDIA Open Model License (commercial use permitted). Post-training code is in the nvidia-cosmos/cosmos-reason1 repository. Runtime: vLLM. Tested hardware: H100, A100, GB200 (NVIDIA Hopper / Blackwell), BF16 precision. The next generation — Cosmos Reason 2 — was released in October 2025. The Cosmos family was subsequently merged into the Cosmos 3 omni-model (COMPUTEX 2026).

Classification

Reasoning modelMultimodal

Family: Cosmos

Applications

Robot policy training Video analytics Data curation & annotation

Access & deployment

DownloadAPIHosted

LocalCloud

Weights: Open weights

Key parameters

🧩 Parameters: 7B (≈8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)

✓ Fine-tuning

📥 Input: text, image, video

Robotics

Embodied task planningScene understandingSpatial reasoningObject affordance understandingMotion planningSpatial prediction

Platforms

NVIDIA Cosmos

Technical specification

Parameters

7B (≈8B total: ViT 0.68B + LLM 7.07B + projection 0.55B)

parameters

License

NVIDIA Open Model License (commercial use)

Hardware requirements

Inference via vLLM on NVIDIA Hopper / Blackwell GPUs (tested: H100, A100, GB200), BF16 precision. The 7B model fits on a single server-grade GPU (e.g. H100 80GB). Operating system: Linux.

Features:✓ Fine-tuning

Modalities

⬇ Input

textimagevideo

⬆ Output

text

Capabilities and applications

Native model capabilities

Reasoning

The model's ability to reason logically and solve complex problems.

Category: reasoning

Multi-step reasoning

Carrying out multi-step chains of reasoning across long, complex tasks.

Category: reasoning

Video understanding

The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.

Category: video

Vision encoder

The model's ability to encode images and video frames into dense representations (embeddings), used for downstream tasks or as a backbone for vision-language models.

Category: vision

Planning

Forming and executing action plans for complex tasks.

Category: planning

Robotics

Embodied task planningScene understandingSpatial reasoningObject affordance understandingMotion planningSpatial prediction

Application domains

Robot policy training Video analytics Data curation & annotation

Benchmark results

7 benchmarks

RoboVQA

accuracy · embodied reasoning benchmark

87.3%