Cosmos Predict

NVIDIA's open World Foundation Model line for generating future world states from text, image or video. The main generative model in the Cosmos platform for training robots and autonomous vehicles.

📦 Archived✓ Public access⚖ Open weightsWorld ModelVideo generation📁 Cosmos

Parameters

4B – 14B (Cosmos Predict 1, wiele wariantów)

parameters

Release date

6 January 2025

🏢NVIDIAProducer

Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud

Overview

Cosmos Predict is a series of open World Foundation Models (WFM) developed by NVIDIA as part of the Cosmos platform for Physical AI. The models generate future world states from a text prompt, an input image or a video clip, and serve, among other uses, as a source of synthetic training data for humanoid robots, autonomous vehicles and vision systems.

Variants

Cosmos Predict 1 (January 2025) is the original generation released as part of the Cosmos platform. The models are available in two architectures — diffusion-based and autoregressive — in 4B, 5B, 7B, 12B, 13B and 14B parameter variants. Modes: Text2World (7B, 14B), Video2World (5B, 7B, 13B, 14B), WorldInterpolator (7B) and multiview variants for autonomous-vehicle (AV) scenarios.

Cosmos Predict 2 (June 2025) and Cosmos Predict 2.5 (August 2025) introduced further improvements in quality and controllability. In October 2025 NVIDIA published Cosmos Reason 2 and Cosmos Transfer 2.5 as the other pillars of the platform. All these families were eventually superseded by Cosmos 3 (omni-model, COMPUTEX 2026), which merges perception, reasoning and generation in a single Mixture-of-Transformers architecture.

Architecture

Diffusion-based WFMs use a dedicated video tokenizer (Cosmos Tokenizer1) and a denoising process in latent space. The autoregressive variant generates future frames sequentially, frame by frame, as an extension of the large-language-model pattern to video. Cosmos Predict supports conditioning on camera signal, text, an initial image and actions.

Applications

Generation of synthetic training data for post-training of robotic models (e.g. NVIDIA Isaac, GR00T) and autonomous vehicles. Closed-loop simulations, multi-view AV, world interpolation (filling in missing frames between two observations). Customers cited by NVIDIA include 1X Technologies, Agility Robotics, Figure AI, Neura Robotics, Toyota Research Institute, General Motors, Uber, Li Auto and others.

Availability

The Cosmos Predict 1, 2 and 2.5 model weights are publicly available on Hugging Face in NVIDIA collections. Training and post-training code is available on GitHub (NVIDIA/Cosmos). The models can also be run through NVIDIA NIM and from the build.nvidia.com catalogue. The earliest releases were under the NVIDIA Open Model License; the successor Cosmos 3 is released under the OpenMDW 1.1 license from the Linux Foundation.

Classification

World ModelVideo generation

Family: Cosmos

Access & deployment

DownloadAPIHosted

LocalCloud

Weights: Open weights

Key parameters

🧩 Parameters: 4B – 14B (Cosmos Predict 1, wiele wariantów)

✓ Fine-tuning

📥 Input: text, image, video, robot state data

Robotics

Environment modelingSpatial predictionScene understandingSpatial reasoning

Platforms

NVIDIA Cosmos

Technical specification

Parameters

4B – 14B (Cosmos Predict 1, wiele wariantów)

parameters

License

NVIDIA Open Model License (Cosmos Predict 1 / 2 / 2.5)

Hardware requirements

Training and inference on NVIDIA GPU clusters (recommended: H100 / B100 / GB200). Inference for the smaller variants (4B–7B) is feasible on a single server-grade GPU; the 12B–14B variants and multiview scenarios require multiple GPUs. Reference implementation in PyTorch.

Features:✓ Fine-tuning

Modalities

⬇ Input

textimagevideorobot_state_data

⬆ Output

video

Capabilities and applications

Native model capabilities

Video generation

The model's ability to generate video clips from a text prompt, image or another video, with control over length, resolution and visual characteristics.

Category: video

Image-to-video

The model's ability to animate a static input image — extending it in time into a consistent video clip according to a description of motion or action.

Category: video

Video understanding

The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.

Category: video

Planning

Forming and executing action plans for complex tasks.

Category: planning

Robotics

Environment modelingSpatial predictionScene understandingSpatial reasoning