What is NVIDIA Cosmos 3?

NVIDIA Cosmos 3 is an open suite of omnimodal world models for Physical AI. The public repository describes it as a model family that jointly processes and generates language, images, video, audio, and action sequences for robotics, autonomous vehicles, simulation, and video analytics.

What are the two Cosmos 3 runtime surfaces?

Cosmos 3 exposes Reasoner and Generator surfaces. Reasoner takes text and vision inputs and returns text for understanding, grounding, planning, and physical reasoning; Generator takes text, vision, sound, and action inputs and can output vision, sound, and action for simulation, future prediction, synthetic data, and robot training.

Which Cosmos 3 models are listed in the release?

The GitHub README lists Cosmos3-Nano at 16B parameters, Cosmos3-Super at 64B, Cosmos3-Super-Text2Image at 64B, Cosmos3-Super-Image2Video at 64B, and Cosmos3-Nano-Policy-DROID at 16B for DROID manipulation and control.

How do developers run Cosmos 3?

NVIDIA documents multiple integration paths: Diffusers for Python-first Generator research, vLLM-Omni for OpenAI-compatible Generator serving, vLLM for Reasoner serving, NIM for a prebuilt Reasoner container, and Cosmos Framework for setup, inference, training, and evaluation workflows.

What hardware and software constraints matter?

The README lists Linux, BF16-tested precision, NVIDIA Ampere, Hopper, and Blackwell GPUs, and CUDA 13 as recommended with CUDA 12.8 also supported. It also notes that system CUDA and PyTorch CUDA major versions must match.

Is Cosmos 3 ready for safety-critical robotics control?

Not by itself. NVIDIA's README explicitly lists limitations such as temporal inconsistency, unstable motion, imperfect action-state consistency, object morphing, inaccurate 3D structure, and implausible dynamics. Safety-critical deployment needs validation, guardrails, and system-level safety analysis.

NVIDIA Cosmos 3: Open Physical AI World Models | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

NVIDIA Cosmos 3: Open Physical AI World Models | explainx.ai Blog | explainx.ai

NVIDIA Cosmos 3 is the new open model family inside NVIDIA's Cosmos platform for Physical AI: robots, autonomous vehicles, industrial video systems, simulation pipelines, and synthetic-data workflows. The public repository positions Cosmos 3 as an omnimodal world model that can reason over text and vision while also generating images, videos, sound, and action sequences.

The important shift is not just another video model. Cosmos 3 exposes two runtime surfaces: Reasoner for understanding and planning, and Generator for world simulation, future prediction, sound/video generation, and action-conditioned rollouts. As of June 4, 2026, the GitHub repository shows roughly 8.7k stars, one launch release, and model access through the NVIDIA Cosmos 3 Hugging Face collection.

This post summarizes the public README, NVIDIA Cosmos page, and linked developer materials as of June 4, 2026. For the event context around Jensen Huang's broader NVIDIA announcements, read our NVIDIA Computex 2026 recap. Check the upstream repo before pinning install commands, benchmark claims, CUDA choices, or license decisions.

TL;DR

Question	Short answer

Model type	Typical job	Cosmos 3 overlap
Vision-language model	Understand images/video and answer questions	Reasoner surface
Video generator	Generate video from text or images	Generator surface
World simulator	Predict how scenes evolve	Generator future prediction and forward dynamics
Robot policy model	Predict or condition on actions	Action modeling and policy workflows
Synthetic-data engine	Create training data at scale	Video, sound, and action-conditioned outputs

Surface	Inputs	Outputs	Best fit
Reasoner	Text and vision	Text	Captioning, temporal localization, 2D grounding, embodied reasoning, physical plausibility, planning
Generator	Text, vision, sound, action	Vision, sound, action	Text-to-image, text-to-video, image-to-video, video-to-video, forward dynamics, policy rollouts

json

[
  {
    "role": "system",
    "content": [{ "type": "text", "text": "You are a helpful assistant." }]
  },
  {
    "role": "user",
    "content": [
      { "type": "video_url", "video_url": "https://example.com/video.mp4" },
      { "type": "text", "text": "List the notable events with approximate timestamps." }
    ]
  }
]

Workflow	Inputs	Outputs
Text-to-image	Text	Vision
Text-to-video	Text	Vision
Text-to-video with sound	Text	Vision and sound
Image-to-video	Text and image	Vision
Video-to-video	Text and video	Vision
Forward dynamics	Text, vision, action	Future visual state
Action policy	Text and vision	Action and rollout video

Model	Size	Primary capability
Cosmos3-Nano	16B	Compact omnimodal model for multimodal understanding, simulation, future prediction, action reasoning, and Physical AI
Cosmos3-Super	64B	Larger omnimodal model for advanced understanding, simulation, future prediction, and action reasoning
Cosmos3-Super-Text2Image	64B	High-fidelity text-to-image generation
Cosmos3-Super-Image2Video	64B	Temporally coherent image-to-video generation
Cosmos3-Nano-Policy-DROID	16B	Vision-language robot policy for DROID manipulation and control

Area	Public README detail
Input types	Text, text + image, text + video, text + image + action
Input formats	Text string, JPG/PNG/JPEG/WEBP images, MP4 video, JSON action arrays
Output types	Image, video, sound, action state, text
Output formats	JPG image, MP4 video, AAC sound muxed into MP4, JSON action values, text
Resolution tiers	256p, 480p, 720p; default 480p
Aspect ratios	16:9, 4:3, 1:1, 3:4, 9:16; default 16:9
Frame rates	10, 16, 24, 30 FPS; default 24 FPS
Frame count	5 to 300 frames; default 189
Prompt guidance	Fewer than 300 words is recommended for world-generation prompts
Sound output	Stereo AAC at 48 kHz when generated with video

Goal	Use	Notes
Generator research	Diffusers	Python-first path for inspecting generation behavior
Generator production serving	vLLM-Omni	OpenAI-compatible API for image, video, sound, and action outputs
Reasoner research	Transformers	Listed as coming soon in the README
Reasoner production serving	vLLM	OpenAI-compatible endpoint for text outputs from text and vision inputs
Turnkey Reasoner deployment	NIM	Prebuilt optimized container
Training and evaluation	Cosmos Framework	Full workflow docs for inference, training, and evaluation

bash

uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=auto \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate \
  av \
  cosmos_guardrail \
  huggingface_hub \
  imageio \
  imageio-ffmpeg \
  torch \
  torchvision \
  transformers

bash

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000 \
  --init-timeout 1800

Benchmark area	Surface	What it measures
Cosmos3-Nano generator	Generator	Text-to-image, text-to-video, and image-to-video latency across PyTorch, vLLM-Omni, and Diffusers
Cosmos3-Super generator	Generator	The same generation modalities at larger checkpoint scale
Cosmos3-Nano reasoner	Reasoner	vLLM serving metrics such as time to first token, request latency, and throughput under concurrency

Approach	Example	Output style	Best for
Omnimodal Physical AI model	Cosmos 3	Text, image, video, sound, action	Robotics, AV, physical reasoning, synthetic data
Persistent 3D world generation	Tencent HY-World 2.0	3DGS, meshes, point clouds	Editable worlds and engine import
Interactive playable worlds	Google Genie-style systems	Video or playable scene rollouts	Agent training and game-like interaction
Real-time audiovisual world models	Odyssey Starchild-style systems	Streaming audio-video	Interactive media and multimodal environments
Video understanding models	VLMs and video agents	Text or structured outputs	Search, captioning, safety, monitoring

What is it?	An open omnimodal world-model family for Physical AI, published under the NVIDIA/cosmos repo
Core surfaces	Reasoner for text output from text/vision; Generator for image, video, sound, and action outputs
Architecture	Unified Mixture-of-Transformers with autoregressive reasoning and diffusion-based multimodal generation
Models listed	Cosmos3-Nano 16B, Cosmos3-Super 64B, Super Text2Image 64B, Super Image2Video 64B, Nano Policy DROID 16B
Developer paths	Diffusers, Transformers, vLLM-Omni, vLLM, NIM, and Cosmos Framework
Main caveat	Outputs can still break physically; safety-critical use needs validation beyond model inference

NVIDIA Cosmos 3: Open Physical AI World Models for Robots and Autonomous Systems

TL;DR

Related posts

Xiaomi-Robotics-U0 — 38B World Model That Boosts π₀.₅ OOD Success to 63%

1X NEO Hands: 25-DoF Tendon Drive, Force Transparency, and 10K Units in 2026

Mistral Robostral Navigate: Map-Less Robot Navigation With One Camera

What Cosmos 3 is

Reasoner vs Generator

Reasoner

Generator

Model family

Architecture in plain English

Inputs, outputs, and generation settings

How to get started

Diffusers path

vLLM-Omni path

Reasoner serving

CUDA and container constraints

Benchmarks and what to read

Use cases that actually fit

Robot learning

Autonomous vehicle training

Industrial video agents

Synthetic data generation

Cosmos 3 vs other world-model approaches

Limitations

Source links

Bottom line