Cosmos Transfer

NVIDIA's open World Foundation Model for controllable video translation: turns simulations (for example from Omniverse) into photorealistic synthetic data for robotics and autonomous vehicles.

📦 Archived✓ Public access⚖ Open weightsVideo generationWorld Model📁 Cosmos

Parameters

7B (Cosmos Transfer 1, wszystkie warianty)

parameters

Release date

19 March 2025

🏢NVIDIAProducer

Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud

Overview

Cosmos Transfer is a series of open World Foundation Models (WFM) developed by NVIDIA as part of the Cosmos platform for Physical AI. The models perform controllable video-to-video translation: they take a structural signal as input (segmentation, depth map, edges, noise, motion outline, simulator video) together with a text description, and produce photorealistic video that preserves the geometry and motion of the source scene.

Variants

Cosmos Transfer 1 (March – April 2025) was released as Cosmos-Transfer1-7B (the main model), Cosmos-Transfer1-7B-Sample-AV (an autonomous-vehicle variant), Cosmos-Transfer1-7B-Sample-AV-Single2MultiView (single-camera to multi-view AV translation) and Cosmos-Transfer1-7B-4KUpscaler (4K upscaling). All variants have around 7B parameters. Cosmos Transfer 2.5 (August 2025) introduced improvements in quality and multi-modal controllability.

Architecture

Cosmos Transfer is a diffusion model with a transformer architecture, operating in latent space using the Cosmos Tokenizer1 video tokenizer. The model is conditioned on multiple information channels (multi-modal control) — including segmentation maps, depth maps, camera signal and a text prompt — which allows precise control over both the structure and the appearance of the generated video. The simulator signal (for example from NVIDIA Omniverse) provides geometry and motion, while the model paints photorealistic textures, lighting and atmosphere on top.

Applications

The main application is scaling synthetic training data: a single scene simulated in Omniverse is multiplied by Cosmos Transfer into many realistic variants (different weather, lighting, geolocation, sensor layouts). Used by NVIDIA in pipelines for training autonomous vehicles (Foretellix, Parallel Domain, General Motors, Toyota Research Institute, Uber, Li Auto) and robotics (1X Technologies, Agility Robotics, Figure AI, Neura Robotics).

Availability

The Cosmos Transfer 1 and 2.5 model weights are publicly available on Hugging Face in NVIDIA collections. Training and post-training code is in the NVIDIA/Cosmos repository on GitHub. The models can also be run through NVIDIA NIM and from the build.nvidia.com catalogue. The earliest releases were under the NVIDIA Open Model License; the Cosmos family was eventually superseded by Cosmos 3 (omni-model, COMPUTEX 2026), in which perception, reasoning and generation are merged into a single Mixture-of-Transformers architecture.

Classification

Video generationWorld Model

Family: Cosmos

Applications

Simulation / synthetic data generation Robot policy training

Access & deployment

DownloadAPIHosted

LocalCloud

Weights: Open weights

Key parameters

🧩 Parameters: 7B (Cosmos Transfer 1, wszystkie warianty)

✓ Fine-tuning

📥 Input: video, image, text, depth…

Robotics

Environment modelingScene understandingSpatial reasoning

Platforms

NVIDIA Cosmos

Technical specification

Parameters

7B (Cosmos Transfer 1, wszystkie warianty)

parameters

License

NVIDIA Open Model License (Cosmos Transfer 1 / 2.5)

Hardware requirements

Training on NVIDIA GPU clusters of the H100 / B100 / GB200 class. Inference of the 7B model is feasible on a single server-grade GPU (H100 80GB) or via NVIDIA NIM. Reference implementation in PyTorch.

Features:✓ Fine-tuning

Modalities

⬇ Input

videoimagetextdepthstructured_data

⬆ Output

video

Capabilities and applications

Native model capabilities

Video generation

The model's ability to generate video clips from a text prompt, image or another video, with control over length, resolution and visual characteristics.

Category: video

Image-to-video

The model's ability to animate a static input image — extending it in time into a consistent video clip according to a description of motion or action.

Category: video

Video understanding

The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.

Category: video

Robotics

Environment modelingScene understandingSpatial reasoning

Application domains

Simulation / synthetic data generation Robot policy training