NVIDIA Cosmos 3 is the new open model family inside NVIDIA's Cosmos platform for Physical AI: robots, autonomous vehicles, industrial video systems, simulation pipelines, and synthetic-data workflows. The public repository positions Cosmos 3 as an omnimodal world model that can reason over text and vision while also generating images, videos, sound, and action sequences.
The important shift is not just another video model. Cosmos 3 exposes two runtime surfaces: Reasoner for understanding and planning, and Generator for world simulation, future prediction, sound/video generation, and action-conditioned rollouts. As of June 4, 2026, the GitHub repository shows roughly 8.7k stars, one launch release, and model access through the NVIDIA Cosmos 3 Hugging Face collection.
This post summarizes the public README, NVIDIA Cosmos page, and linked developer materials as of June 4, 2026. For the event context around Jensen Huang's broader NVIDIA announcements, read our NVIDIA Computex 2026 recap. Check the upstream repo before pinning install commands, benchmark claims, CUDA choices, or license decisions.
TL;DR
| Question | Short answer |
|---|---|
| What is it? | An open omnimodal world-model family for Physical AI, published under the NVIDIA/cosmos repo |
| Core surfaces | Reasoner for text output from text/vision; Generator for image, video, sound, and action outputs |
| Architecture | Unified Mixture-of-Transformers with autoregressive reasoning and diffusion-based multimodal generation |
| Models listed | Cosmos3-Nano 16B, Cosmos3-Super 64B, Super Text2Image 64B, Super Image2Video 64B, Nano Policy DROID 16B |
| Developer paths | Diffusers, Transformers, vLLM-Omni, vLLM, NIM, and Cosmos Framework |
| Main caveat | Outputs can still break physically; safety-critical use needs validation beyond model inference |
What Cosmos 3 is
Cosmos is NVIDIA's open platform of world models, datasets, and tools for building Physical AI. The broader platform includes Cosmos Framework, Cosmos Curator, and Cosmos Evaluator; Cosmos 3 is the newest model family inside that stack.
The NVIDIA Cosmos product page describes Cosmos 3 as an open Physical AI foundation model with native reasoning, world generation, and action generation built on Mixture-of-Transformers. The public README says the model family jointly processes and generates language, images, video, audio, and action sequences.
That makes Cosmos 3 easier to place if you compare it with adjacent model classes:
| Model type | Typical job | Cosmos 3 overlap |
|---|---|---|
| Vision-language model | Understand images/video and answer questions | Reasoner surface |
| Video generator | Generate video from text or images | Generator surface |
| World simulator | Predict how scenes evolve | Generator future prediction and forward dynamics |
| Robot policy model | Predict or condition on actions | Action modeling and policy workflows |
| Synthetic-data engine | Create training data at scale | Video, sound, and action-conditioned outputs |
NVIDIA's framing is that Physical AI teams should not need one model for captioning, another for simulation, another for action prediction, and another for video generation. Cosmos 3 attempts to make these capabilities share one architectural backbone and one developer ecosystem.
Reasoner vs Generator
The cleanest way to understand the release is to separate the two runtime surfaces.
| Surface | Inputs | Outputs | Best fit |
|---|---|---|---|
| Reasoner | Text and vision | Text | Captioning, temporal localization, 2D grounding, embodied reasoning, physical plausibility, planning |
| Generator | Text, vision, sound, action | Vision, sound, action | Text-to-image, text-to-video, image-to-video, video-to-video, forward dynamics, policy rollouts |
Reasoner
Reasoner is the understanding path. It accepts text plus images or video and returns text. In the README examples, this covers detailed captioning, timestamped event localization, common-sense physical judgment, bounding-box grounding, describe-anything prompts, action chain-of-thought, driving-scene reasoning, and likely-next-action prediction.
The message format follows Qwen3-VL-compatible conventions. A basic request shape looks like this:
[
{
"role": "system",
"content": [{ "type": "text", "text": "You are a helpful assistant." }]
},
{
"role": "user",
"content": [
{ "type": "video_url", "video_url": "https://example.com/video.mp4" },
{ "type": "text", "text": "List the notable events with approximate timestamps." }
]
}
]
Reasoner is the better path when you want an answer, a plan, a classification, a JSON grounding result, or an explanation of visible physical context.
Generator
Generator is the world-production path. It accepts text, vision, sound, and action conditioning, then produces non-text outputs: images, videos, synchronized sound, and action states.
The README examples include:
| Workflow | Inputs | Outputs |
|---|---|---|
| Text-to-image | Text | Vision |
| Text-to-video | Text | Vision |
| Text-to-video with sound | Text | Vision and sound |
| Image-to-video | Text and image | Vision |
| Video-to-video | Text and video | Vision |
| Forward dynamics | Text, vision, action | Future visual state |
| Action policy | Text and vision | Action and rollout video |
The distinction matters operationally. If you are building a video analytics agent, Reasoner is the starting point. If you are generating synthetic robot training clips or predicting future observations from an action trace, Generator is the starting point.
Model family
The release README lists five primary model entries:
| Model | Size | Primary capability |
|---|---|---|
| Cosmos3-Nano | 16B | Compact omnimodal model for multimodal understanding, simulation, future prediction, action reasoning, and Physical AI |
| Cosmos3-Super | 64B | Larger omnimodal model for advanced understanding, simulation, future prediction, and action reasoning |
| Cosmos3-Super-Text2Image | 64B | High-fidelity text-to-image generation |
| Cosmos3-Super-Image2Video | 64B | Temporally coherent image-to-video generation |
| Cosmos3-Nano-Policy-DROID | 16B | Vision-language robot policy for DROID manipulation and control |
This model list is worth checking against older summaries. Some earlier coverage, including our NVIDIA Computex event recap, discussed Cosmos 3 in terms of smaller Nano/Super parameter counts around the keynote messaging. The public GitHub README now lists 16B and 64B entries for the launch artifacts, so use the repository as the canonical current reference.
Architecture in plain English
Cosmos 3 uses a unified Mixture-of-Transformers architecture with two jobs inside one model family:
- Autoregressive reasoning for language and visual understanding.
- Diffusion-based generation for images, video, audio, and action tokens.
In Reasoner mode, the model processes language and visual tokens through causal self-attention, similar to how a multimodal language model predicts the next text token. In Generator mode, noisy multimodal tokens are denoised through full attention, which is closer to the diffusion path used in modern image and video generators.
Both modes share the same high-level transformer architecture, multimodal attention layers, and a 3D multidimensional rotary position embedding representation. The 3D positional design matters because world models need to represent not only what appears in a frame, but also where it is and how it changes over time.
For a robotics team, that means Cosmos 3 is trying to keep perception, temporal prediction, and action-conditioned generation in the same representational space instead of stitching together separate systems after the fact.
Inputs, outputs, and generation settings
Cosmos 3 supports a broad I/O surface, but the defaults are still concrete enough to plan around.
| Area | Public README detail |
|---|---|
| Input types | Text, text + image, text + video, text + image + action |
| Input formats | Text string, JPG/PNG/JPEG/WEBP images, MP4 video, JSON action arrays |
| Output types | Image, video, sound, action state, text |
| Output formats | JPG image, MP4 video, AAC sound muxed into MP4, JSON action values, text |
| Resolution tiers | 256p, 480p, 720p; default 480p |
| Aspect ratios | 16:9, 4:3, 1:1, 3:4, 9:16; default 16:9 |
| Frame rates | 10, 16, 24, 30 FPS; default 24 FPS |
| Frame count | 5 to 300 frames; default 189 |
| Prompt guidance | Fewer than 300 words is recommended for world-generation prompts |
| Sound output | Stereo AAC at 48 kHz when generated with video |
Action conditioning is where Cosmos 3 becomes more specialized than a general video model. The README lists support for action dimensions across camera motion, autonomous vehicles, egocentric motion, single-arm robots, dual-arm robot settings, and humanoid robots. That is the part Physical AI teams should inspect most closely, because action dimensionality and embodiment assumptions determine whether a demo maps to a real control pipeline.
How to get started
Before running examples, the README asks developers to create a Hugging Face token and authenticate locally:
uvx hf@latest auth login
From there, choose the integration based on the job.
| Goal | Use | Notes |
|---|---|---|
| Generator research | Diffusers | Python-first path for inspecting generation behavior |
| Generator production serving | vLLM-Omni | OpenAI-compatible API for image, video, sound, and action outputs |
| Reasoner research | Transformers | Listed as coming soon in the README |
| Reasoner production serving | vLLM | OpenAI-compatible endpoint for text outputs from text and vision inputs |
| Turnkey Reasoner deployment | NIM | Prebuilt optimized container |
| Training and evaluation | Cosmos Framework | Full workflow docs for inference, training, and evaluation |
Diffusers path
The Diffusers path is aimed at Generator research and model development. The README installs the latest Diffusers from GitHub alongside acceleration and media dependencies:
uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=auto \
"diffusers @ git+https://github.com/huggingface/diffusers.git" \
accelerate \
av \
cosmos_guardrail \
huggingface_hub \
imageio \
imageio-ffmpeg \
torch \
torchvision \
transformers
The important operational note: --torch-backend=auto is there to match your installed NVIDIA driver with a compatible CUDA wheel. If you force a newer CUDA wheel than your driver supports, torch.cuda.is_available() can return False even though the machine has a GPU.
vLLM-Omni path
For Generator serving, the README points to vLLM-Omni. The official Docker image is the practical path while full upstream support continues landing:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v "$(pwd):/workspace" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-omni:cosmos3 \
vllm serve nvidia/Cosmos3-Nano \
--omni \
--model-class-name Cosmos3OmniDiffusersPipeline \
--allowed-local-media-path / \
--port 8000 \
--init-timeout 1800
The long init timeout is not cosmetic. Large checkpoints can exceed default server startup limits, so the README recommends --init-timeout 1800.
Reasoner serving
For Reasoner production inference, use vLLM behind an OpenAI-compatible chat-completions API. For teams that do not want to manage vLLM and CUDA setup directly, the README also documents a Reasoner path through NVIDIA NIM.
CUDA and container constraints
Cosmos 3 is not a laptop toy unless that laptop is effectively a serious NVIDIA workstation. The README lists:
- Operating system: Linux
- Precision: BF16 tested
- GPU architectures: NVIDIA Ampere, Hopper, and Blackwell
- CUDA: CUDA 13 recommended, CUDA 12.8 supported
- Base containers: NGC PyTorch
25.09-py3for CUDA 13 or25.06-py3for CUDA 12
The most common setup trap is a mismatch between system CUDA, driver support, PyTorch's CUDA build, and the uv torch backend. If torch.cuda.is_available() is false, do not assume Cosmos is broken. Check the driver, check nvidia-smi, check torch.version.cuda, and install a matching torch backend.
The README also calls out minimal container failures such as missing libxcb.so.1 or libgl1. On headless servers, install the system graphics packages before blaming model code:
apt-get install -y libxcb1 libgl1 libglib2.0-0
Benchmarks and what to read
NVIDIA keeps Cosmos 3 serving and generation benchmarks in inference_benchmarks.md. The README says those tables cover:
| Benchmark area | Surface | What it measures |
|---|---|---|
| Cosmos3-Nano generator | Generator | Text-to-image, text-to-video, and image-to-video latency across PyTorch, vLLM-Omni, and Diffusers |
| Cosmos3-Super generator | Generator | The same generation modalities at larger checkpoint scale |
| Cosmos3-Nano reasoner | Reasoner | vLLM serving metrics such as time to first token, request latency, and throughput under concurrency |
Use those numbers as engineering inputs, not marketing conclusions. For deployment planning, the real questions are:
- Which exact checkpoint?
- Which resolution and frame count?
- Which GPU and tensor-parallel setup?
- Which serving stack?
- Is the benchmark measuring first-token latency, full request latency, diffusion generation time, or throughput?
World-model benchmarks are especially easy to misread because "video generation latency" and "chat-completion latency" are not comparable workloads.
Use cases that actually fit
Cosmos 3 is most interesting where teams need models that understand or simulate physical state, not just produce attractive clips.
Robot learning
Robot teams can use Cosmos 3 for visual reasoning, task planning, next-action prediction, action-conditioned rollouts, and policy-model development. The Cosmos3-Nano-Policy-DROID entry is a direct signal that NVIDIA is targeting manipulation and control, not only video demos.
The hard part is still embodiment. A robot policy is not portable just because two tasks both involve "a robot arm." Camera layout, gripper type, action space, environment distribution, and safety constraints all matter.
Autonomous vehicle training
Cosmos 3 can generate future rollouts and synthetic data from visual and action context. That is useful for weather diversity, lighting variation, rare events, and policy stress tests.
The failure mode is over-trusting plausible video. A clip can look physically reasonable while still violating sensor geometry, road-agent behavior, or downstream planner assumptions. AV use needs evaluation against simulator constraints, real logs, and safety cases.
Industrial video agents
Reasoner can support dense captioning, situation understanding, physical plausibility analysis, and temporal localization across factory, warehouse, logistics, traffic, and inspection footage.
For this use case, Cosmos 3 sits near NVIDIA's existing video analytics work. It may become a stronger reasoning and synthetic-data component inside broader video search, alerting, and summarization systems.
Synthetic data generation
The Generator path can produce images, video, synchronized sound, and action-conditioned future states. That makes Cosmos 3 relevant when real-world data is expensive, rare, dangerous, private, or hard to label.
Synthetic data still needs measurement. Teams should track whether generated data improves target-task performance, where it introduces bias, and whether rare-event generation creates believable but wrong edge cases.
Cosmos 3 vs other world-model approaches
The world-model landscape is splitting into several shapes:
| Approach | Example | Output style | Best for |
|---|---|---|---|
| Omnimodal Physical AI model | Cosmos 3 | Text, image, video, sound, action | Robotics, AV, physical reasoning, synthetic data |
| Persistent 3D world generation | Tencent HY-World 2.0 | 3DGS, meshes, point clouds | Editable worlds and engine import |
| Interactive playable worlds | Google Genie-style systems | Video or playable scene rollouts | Agent training and game-like interaction |
| Real-time audiovisual world models | Odyssey Starchild-style systems | Streaming audio-video | Interactive media and multimodal environments |
| Video understanding models | VLMs and video agents | Text or structured outputs | Search, captioning, safety, monitoring |
Cosmos 3's differentiator is breadth across reasoning, generation, and action. Persistent 3D systems may be better when you need editable assets. Pure VLMs may be cheaper and simpler when you only need answers from video. Video generators may be more accessible when the goal is creative content rather than physical prediction.
Limitations
The README is explicit that Cosmos 3 can still produce artifacts in long, high-resolution, or physically complex outputs. Listed failure modes include:
- Temporal inconsistency
- Unstable camera or object motion
- Inaccurate sound-video alignment
- Imperfect action-state consistency
- Object morphing
- Inaccurate 3D structure
- Implausible physical dynamics
Those are not minor caveats for Physical AI. They are the boundary between a useful research system and a deployable control system.
For safety-critical robotics, autonomous driving, industrial automation, or multi-agent behavior, Cosmos 3 should be treated as one component inside a validated pipeline. You still need simulation checks, real-world tests, policy constraints, monitoring, fallback behavior, and license review.
Source links
- NVIDIA/cosmos GitHub repository
- NVIDIA Cosmos product page
- NVIDIA Cosmos 3 Hugging Face collection
- Cosmos Framework
- Cosmos 3 inference benchmarks
- OpenMDW-1.1 license
- ExplainX NVIDIA Computex 2026 event recap
- ExplainX guide to world models
- ExplainX NVIDIA video search blueprint guide
Bottom line
Cosmos 3 is NVIDIA's most concrete open attempt to make Physical AI development feel like a unified model stack: reason over video, generate plausible futures, condition on actions, serve through OpenAI-compatible APIs, and train or evaluate through the Cosmos ecosystem.
The release is strongest for teams that already understand GPU infrastructure, simulation, robotics data, or video analytics. For everyone else, the right first step is not "deploy a robot." It is to pick one bounded workflow: caption a video, localize an event, generate a short action-conditioned rollout, or benchmark a text-to-video path on a known GPU.
Status note: repository stars, model listings, CUDA guidance, and vLLM-Omni compatibility were checked against public NVIDIA materials on June 4, 2026. Verify upstream links before using this for procurement, benchmark claims, or production architecture.