Robots Atlas>ROBOTS ATLAS
Cosmos Transfer

Cosmos Transfer

Family: Cosmos
NVIDIA's open World Foundation Model for controllable video translation: turns simulations (for example from Omniverse) into photorealistic synthetic data for robotics and autonomous vehicles.
📦 Archived✓ Public access⚖ Open weightsVideo generationWorld Model📁 Cosmos
Parameters
7B (Cosmos Transfer 1, wszystkie warianty)
parameters
Release date
19 March 2025
Access:DownloadAPIHostedDeployment:💻 Local☁ Cloud

Overview

Cosmos Transfer is a series of open World Foundation Models (WFM) developed by NVIDIA as part of the Cosmos platform for Physical AI. The models perform controllable video-to-video translation: they take a structural signal as input (segmentation, depth map, edges, noise, motion outline, simulator video) together with a text description, and produce photorealistic video that preserves the geometry and motion of the source scene.

Variants

Cosmos Transfer 1 (March – April 2025) was released as Cosmos-Transfer1-7B (the main model), Cosmos-Transfer1-7B-Sample-AV (an autonomous-vehicle variant), Cosmos-Transfer1-7B-Sample-AV-Single2MultiView (single-camera to multi-view AV translation) and Cosmos-Transfer1-7B-4KUpscaler (4K upscaling). All variants have around 7B parameters. Cosmos Transfer 2.5 (August 2025) introduced improvements in quality and multi-modal controllability.

Architecture

Cosmos Transfer is a diffusion model with a transformer architecture, operating in latent space using the Cosmos Tokenizer1 video tokenizer. The model is conditioned on multiple information channels (multi-modal control) — including segmentation maps, depth maps, camera signal and a text prompt — which allows precise control over both the structure and the appearance of the generated video. The simulator signal (for example from NVIDIA Omniverse) provides geometry and motion, while the model paints photorealistic textures, lighting and atmosphere on top.

Applications

The main application is scaling synthetic training data: a single scene simulated in Omniverse is multiplied by Cosmos Transfer into many realistic variants (different weather, lighting, geolocation, sensor layouts). Used by NVIDIA in pipelines for training autonomous vehicles (Foretellix, Parallel Domain, General Motors, Toyota Research Institute, Uber, Li Auto) and robotics (1X Technologies, Agility Robotics, Figure AI, Neura Robotics).

Availability

The Cosmos Transfer 1 and 2.5 model weights are publicly available on Hugging Face in NVIDIA collections. Training and post-training code is in the NVIDIA/Cosmos repository on GitHub. The models can also be run through NVIDIA NIM and from the build.nvidia.com catalogue. The earliest releases were under the NVIDIA Open Model License; the Cosmos family was eventually superseded by Cosmos 3 (omni-model, COMPUTEX 2026), in which perception, reasoning and generation are merged into a single Mixture-of-Transformers architecture.

Classification
Video generationWorld Model
Family: Cosmos
Access & deployment
DownloadAPIHosted
LocalCloud
Weights: Open weights
Key parameters
🧩 Parameters: 7B (Cosmos Transfer 1, wszystkie warianty)
✓ Fine-tuning
📥 Input: video, image, text, depth
Robotics
Environment modelingScene understandingSpatial reasoning
Platforms

Technical specification

Parameters
7B (Cosmos Transfer 1, wszystkie warianty)
parameters
License
NVIDIA Open Model License (Cosmos Transfer 1 / 2.5)
Hardware requirements
Training on NVIDIA GPU clusters of the H100 / B100 / GB200 class. Inference of the 7B model is feasible on a single server-grade GPU (H100 80GB) or via NVIDIA NIM. Reference implementation in PyTorch.
Features:Fine-tuning
Modalities
⬇ Input
videoimagetextdepthstructured_data
⬆ Output
video

Capabilities and applications

Native model capabilities
Video generation
The model's ability to generate video clips from a text prompt, image or another video, with control over length, resolution and visual characteristics.
Category: video
Image-to-video
The model's ability to animate a static input image — extending it in time into a consistent video clip according to a description of motion or action.
Category: video
Video understanding
The model's ability to analyse and interpret video content — recognising actions, motion, events and relationships between objects over time.
Category: video
Robotics
Environment modelingScene understandingSpatial reasoning

Technical architecture

Model Form
Training Techniques

Deployment and security