Robots Atlas>ROBOTS ATLAS
Cosmos Predict

Cosmos Predict

Family: Cosmos
NVIDIA's open World Foundation Model line for generating future world states from text, image or video. The main generative model in the Cosmos platform for training robots and autonomous vehicles.
📦 Archived✓ Public access⚖ Open weightsWorld ModelVideo generation📁 Cosmos
Parameters
4B – 14B (Cosmos Predict 1, wiele wariantów)
parameters
Release date
6 January 2025
Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud

Overview

Cosmos Predict is a series of open World Foundation Models (WFM) developed by NVIDIA as part of the Cosmos platform for Physical AI. The models generate future world states from a text prompt, an input image or a video clip, and serve, among other uses, as a source of synthetic training data for humanoid robots, autonomous vehicles and vision systems.

Variants

Cosmos Predict 1 (January 2025) is the original generation released as part of the Cosmos platform. The models are available in two architectures — diffusion-based and autoregressive — in 4B, 5B, 7B, 12B, 13B and 14B parameter variants. Modes: Text2World (7B, 14B), Video2World (5B, 7B, 13B, 14B), WorldInterpolator (7B) and multiview variants for autonomous-vehicle (AV) scenarios.

Cosmos Predict 2 (June 2025) and Cosmos Predict 2.5 (August 2025) introduced further improvements in quality and controllability. In October 2025 NVIDIA published Cosmos Reason 2 and Cosmos Transfer 2.5 as the other pillars of the platform. All these families were eventually superseded by Cosmos 3 (omni-model, COMPUTEX 2026), which merges perception, reasoning and generation in a single Mixture-of-Transformers architecture.

Architecture

Diffusion-based WFMs use a dedicated video tokenizer (Cosmos Tokenizer1) and a denoising process in latent space. The autoregressive variant generates future frames sequentially, frame by frame, as an extension of the large-language-model pattern to video. Cosmos Predict supports conditioning on camera signal, text, an initial image and actions.

Applications

Generation of synthetic training data for post-training of robotic models (e.g. NVIDIA Isaac, GR00T) and autonomous vehicles. Closed-loop simulations, multi-view AV, world interpolation (filling in missing frames between two observations). Customers cited by NVIDIA include 1X Technologies, Agility Robotics, Figure AI, Neura Robotics, Toyota Research Institute, General Motors, Uber, Li Auto and others.

Availability

The Cosmos Predict 1, 2 and 2.5 model weights are publicly available on Hugging Face in NVIDIA collections. Training and post-training code is available on GitHub (NVIDIA/Cosmos). The models can also be run through NVIDIA NIM and from the build.nvidia.com catalogue. The earliest releases were under the NVIDIA Open Model License; the successor Cosmos 3 is released under the OpenMDW 1.1 license from the Linux Foundation.

Classification
World ModelVideo generation
Family: Cosmos
Access & deployment
DownloadAPIHosted
LocalCloud
Weights: Open weights
Key parameters
🧩 Parameters: 4B – 14B (Cosmos Predict 1, wiele wariantów)
✓ Fine-tuning
📥 Input: text, image, video, robot state data
Robotics
Environment modelingSpatial predictionScene understandingSpatial reasoning
Platforms

Technical specification

Parameters
4B – 14B (Cosmos Predict 1, wiele wariantów)
parameters
License
NVIDIA Open Model License (Cosmos Predict 1 / 2 / 2.5)
Hardware requirements
Training and inference on NVIDIA GPU clusters (recommended: H100 / B100 / GB200). Inference for the smaller variants (4B–7B) is feasible on a single server-grade GPU; the 12B–14B variants and multiview scenarios require multiple GPUs. Reference implementation in PyTorch.
Features:Fine-tuning
Modalities
⬇ Input
textimagevideorobot_state_data
⬆ Output
video

Capabilities and applications

Native model capabilities
Video generation
The model's ability to generate video clips from a text prompt, image or another video, with control over length, resolution and visual characteristics.
Category: video
Image-to-video
The model's ability to animate a static input image — extending it in time into a consistent video clip according to a description of motion or action.
Category: video
Video understanding
The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.
Category: video
Planning
The model's ability to determine a sequence of actions leading to a goal — predicting the consequences of actions and selecting an optimal path in a given environment.
Category: planning
Robotics
Environment modelingSpatial predictionScene understandingSpatial reasoning

Technical architecture

Deployment and security

☁ Available on platforms