← Blog
explainx / blog

Cohere Command A+: the first fully Apache 2.0 enterprise AI model that runs on 2 H100s (May 2026)

Cohere released Command A+ on May 20, 2026—a 218B parameter MoE model (25B active) with native citation generation, W4A4 lossless quantization, and full Apache 2.0 licensing. Runs on a single NVIDIA Blackwell B200 or just 2 H100 GPUs. First fully Apache-licensed frontier model from Cohere, positioning sovereign AI as accessible to enterprises and nations.

13 min readYash Thakker
CohereCommand A+Open SourceApache 2.0Enterprise AISovereign AI

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

Cohere Command A+: the first fully Apache 2.0 enterprise AI model that runs on 2 H100s (May 2026)

On May 20, 2026, Cohere released Command A+—a 218 billion parameter Sparse Mixture-of-Experts (MoE) language model with 25 billion active parameters and full Apache 2.0 open-source licensing. The model features native citation generation (explicit grounding spans linking every claim to source documents), W4A4 lossless quantization (enabling deployment on just 2 NVIDIA H100 GPUs), and 48-language support with improved efficiency in non-European languages. Command A+ is Cohere's first fully Apache-licensed frontier model, positioning sovereign AI as accessible to enterprises and nations seeking to control their own AI infrastructure. The release marks a breakthrough in quantization techniques and a strategic shift toward open-weight models for critical infrastructure.

This article is a field guide: what Command A+ is, key features, benchmarks, sovereign AI context, deployment options, and when to choose Command A+ over closed models.

TL;DR

QuestionShort answer
What is it?A 218B parameter MoE model (25B active) with native citations, W4A4 quantization, and full Apache 2.0 license—first fully open frontier model from Cohere.
AnnouncedMay 20, 2026 by Cohere.
Key innovationW4A4 lossless quantization—4-bit weights + activations with no quality degradation, enabling 2-H100 deployment.
Native citationsGenerates explicit grounding spans linking every factual claim to specific source documents or database rows.
Performance2× faster output speed, 30% lower latency vs previous Command A models. Competitive with GPT-OSS on benchmarks.
Languages48 world languages with improved efficiency in non-European languages (Arabic, Hindi, Chinese, etc.).
DeploymentRuns on a single NVIDIA Blackwell B200 or 2 H100 GPUs. Available in BF16, FP8, and W4A4 formats.
LicenseFull Apache 2.0—not just weights, but all components. No restrictions on commercial or sovereign use.
Use caseSovereign AI (nations/enterprises controlling their own infrastructure), enterprise RAG, agentic workflows, critical infrastructure.
DownloadHugging Face: CohereLabs/command-a-plus-05-2026-w4a4
Related ecosystemPairs with Cohere Transcribe, agent skills, and sovereign AI initiatives.
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

Primary source: Cohere blog · Venture Beat coverage


What is Command A+?

Command A+ is a decoder-only Sparse Mixture-of-Experts (MoE) Transformer with 218 billion total parameters, with only 25 billion active during any given generation step. This design balances frontier performance with efficient inference—generating high-quality outputs while using a fraction of the compute required by dense models.

Core capabilities:

  • Native citation generation with explicit grounding spans
  • Complex reasoning and multi-step agentic workflows
  • Multimodal document processing (text + structured data)
  • 48-language support with improved non-European efficiency
  • RAG-optimized with low-latency retrieval integration

Architecture highlights:

  • MoE design: 218B total, 25B active (reduces memory bandwidth)
  • Quantization: Available in BF16 (16-bit), FP8 (8-bit), W4A4 (4-bit)
  • Context window: TBD (not specified in initial release docs)
  • Training data: Enterprise-grade corpus (not disclosed)

Key differentiator: Command A+ is the first fully Apache 2.0 licensed frontier model from Cohere—not just model weights, but all components (tokenizer, config, training recipes where applicable). This enables sovereign AI use cases where nations and enterprises need full control without vendor lock-in.


Feature 01: Native citation generation with grounding spans

Problem: Most LLMs hallucinate or provide vague "sources" that don't match the actual content. Post-hoc retrieval systems bolt citations onto generated text, but the mapping is often wrong.

Solution: Command A+ generates explicit "grounding spans" natively during inference—each factual claim is directly linked to the specific source document or database row it pulled the information from.

Example (from Cohere blog):

  • User query: "What were the revenue figures for Q4 2025?"
  • Command A+ output: "Q4 2025 revenue was $2.3B, up 15% YoY. [Source: Q4_2025_earnings.pdf, page 3, paragraph 2]"
  • Grounding span: {"source": "Q4_2025_earnings.pdf", "page": 3, "paragraph": 2, "text": "Revenue for the quarter ending December 31, 2025 totaled $2.3 billion, reflecting a 15% increase compared to Q4 2024."}

Why this matters:

  • Reduced hallucination (model can't cite sources that don't exist)
  • Verifiable outputs (users can check the exact source passage)
  • Trust in critical applications (legal, medical, financial workflows)

How it works: During training, Command A+ learns to jointly model the generation of text and the retrieval of grounding spans. At inference, the model's attention mechanism explicitly tracks which source tokens contributed to each generated token, producing citations as a byproduct of generation (not a post-hoc step).

Use cases:

  • Enterprise RAG (retrieval-augmented generation over internal docs)
  • Legal research (cite case law, statutes, precedent with exact passages)
  • Medical diagnosis assistants (ground recommendations in clinical guidelines)
  • Financial analysis (cite earnings reports, SEC filings, analyst notes)

Feature 02: W4A4 lossless quantization—breakthrough efficiency

Problem: Most quantization techniques (e.g., GPTQ, AWQ) introduce quality degradation—perplexity increases, reasoning degrades, citations become less accurate.

Solution: Cohere's W4A4 quantization compresses both weights (W) and activations (A) to 4 bits while maintaining lossless performance relative to the BF16 baseline.

Technical details:

  • W4A4: 4-bit weights + 4-bit activations
  • Compression ratio: ~75% reduction in memory footprint (218B → ~55GB)
  • Inference speed: 2× faster output, 30% lower latency vs previous Command A models
  • Quality: No measurable perplexity degradation on internal benchmarks

Why this is a breakthrough: Most 4-bit quantization schemes (GPTQ, AWQ) only compress weights, leaving activations in FP16/BF16. Command A+ compresses both, enabling deployment on 2 H100 GPUs instead of 8+.

Deployment options:

FormatPrecisionVRAM requiredSpeedQuality
BF1616-bit~400GB (8+ H100s)BaselineBaseline
FP88-bit~200GB (4 H100s)1.5× fasterMinimal loss
W4A44-bit~80GB (2 H100s)2× fasterLossless

Single-node deployment:

  • NVIDIA Blackwell B200: 192GB HBM3e → fits W4A4 in single GPU
  • 2× H100: 80GB each → 160GB total, fits W4A4 comfortably

Why this matters for sovereign AI: Nations and enterprises can deploy frontier AI on affordable hardware (2 H100s cost ~$60K vs 8 H100s at ~$240K). This lowers the barrier to sovereign AI infrastructure.


Feature 03: 48-language support with non-European efficiency

Problem: Most LLMs are optimized for European languages (English, Spanish, French, German) and underperform on non-European languages (Arabic, Hindi, Chinese, Japanese, etc.).

Solution: Command A+ is trained with native support for 48 world languages and improved tokenization efficiency in non-European scripts.

Key improvements:

  • Arabic, Hindi, Chinese, Japanese, Korean: 20-30% fewer tokens per sentence vs GPT-4
  • Code-switching: Better handling of mixed-language text (e.g., English + Hindi in same sentence)
  • Cultural context: Training data includes regional knowledge graphs, not just translated English text

Example (from Cohere press release):

  • Input (Hindi): "भारत की राजधानी क्या है?" (What is the capital of India?)
  • Command A+ output: "भारत की राजधानी नई दिल्ली है। [स्रोत: भारत सरकार की आधिकारिक वेबसाइट]"
  • Translation: "The capital of India is New Delhi. [Source: Official website of the Government of India]"
  • Native citation: Grounding span points to Hindi-language government source, not English Wikipedia.

Why this matters:

  • Sovereign AI for non-Western nations (India, Saudi Arabia, Japan, etc.) can deploy AI in their own languages without relying on English-centric models
  • Multilingual enterprises (e.g., global banks, UN agencies) can process documents in native languages
  • Cost savings (fewer tokens = lower inference cost)

Feature 04: Agentic workflow optimization

Problem: Most LLMs struggle with multi-step agentic tasks—tool calling, long chains of reasoning, error recovery.

Solution: Command A+ is optimized for agentic workflows with:

  • Tool-calling fidelity (accurate function schemas, correct argument passing)
  • Long-context reasoning (maintains coherence over complex tasks)
  • Error recovery (gracefully handles failed tool calls, retries with corrected args)

Benchmark improvements (from Cohere blog):

  • Agentic tasks: Across-the-board improvements vs previous Command A models
  • Multi-step reasoning: 30% lower latency, 2× faster output speed
  • Tool-calling accuracy: Competitive with GPT-OSS (unclear which variant)

Use cases:

  • Customer support agents (multi-turn conversations with CRM lookups)
  • Data analysis pipelines (query databases, generate charts, summarize findings)
  • DevOps automation (monitor logs, diagnose issues, propose fixes)
  • Legal document review (search case law, extract clauses, draft summaries)

Feature 05: Full Apache 2.0 licensing—sovereign AI

Problem: Most "open" models release weights only under restrictive licenses (e.g., Llama's "acceptable use policy," Mistral's tiered licensing). This blocks sovereign AI use cases where nations/enterprises need full control.

Solution: Command A+ is released under full Apache 2.0 license—not just weights, but all components (tokenizer, config, training recipes where applicable). No restrictions on commercial, government, or military use.

What this enables:

  • Sovereign AI infrastructure (nations can run frontier AI on-premises without vendor lock-in)
  • Critical infrastructure (banks, hospitals, defense agencies can deploy without external dependencies)
  • Custom fine-tuning (modify the model for domain-specific tasks)
  • On-premises deployment (no data leaves your network)

Cohere's stated mission (from press release):

"Command A+ advances Cohere's mission to make sovereign AI a technological reality—giving enterprises and nations the power to control their own AI infrastructure."

Comparison to other licenses:

ModelLicenseCommercial useModificationsRedistributionSovereign use
Command A+Apache 2.0✅ Unlimited✅ Yes✅ Yes✅ Yes
Llama 3Custom (Meta)✅ With restrictions✅ Yes❌ Restricted❌ Restricted
Mistral LargeMistral AI License✅ Tiered✅ Limited❌ No❌ No
GPT-4Closed❌ API only❌ No❌ No❌ No
GeminiClosed❌ API only❌ No❌ No❌ No

Why this matters: For the first time, a frontier-class model (218B params, competitive with GPT-4-class) is available with zero licensing friction for sovereign AI use.


Benchmarks and performance

From Cohere blog:

Speed and efficiency

  • 2× faster output speed vs previous Command A models
  • 30% lower latency (time to first token + generation)
  • Runs on 2 H100s (W4A4 quantization)

Reasoning and agentic tasks

  • Across-the-board improvements for agentic, reasoning, and multi-step tasks
  • Competitive with GPT-OSS (unclear which variant—GPT-4? GPT-5?)
  • Tool-calling accuracy: Comparable to GPT-4-class models

Multimodal document processing

  • Native handling of structured data (tables, CSVs, JSON)
  • Grounding spans link claims to specific cells/rows
  • Better than previous Command A on document QA benchmarks

Missing public benchmarks: As of May 22, 2026, Cohere has not released:

  • MMLU, HellaSwag, TruthfulQA scores
  • Head-to-head comparison vs Llama 3, Mistral Large, GPT-4
  • Citation accuracy metrics (precision/recall on grounding spans)

Community reactions (from X/Twitter):

  • Positive: "Running on 2 H100s is huge for practical deployment."
  • Skeptical: "Where are the benches vs SOTA open models (Qwen series)?"
  • Excited: "If this is better than Gemini 3.1 Flash Lite, that's game-changing for fast agent products."

Use cases: sovereign AI, enterprise RAG, critical infrastructure

01. Sovereign AI for nations

Problem: Countries like India, Saudi Arabia, Japan, and EU nations want to deploy frontier AI without dependency on US cloud providers (AWS, Azure, GCP) or closed APIs (OpenAI, Anthropic, Google).

Solution: Command A+ enables on-premises deployment with:

  • Full Apache 2.0 license (no vendor lock-in)
  • 2-H100 efficiency (affordable hardware)
  • 48-language support (native Hindi, Arabic, Japanese, etc.)
  • Native citations (verifiable outputs for government/legal use)

Example: India's National AI Infrastructure can deploy Command A+ on local data centers, process Hindi/Tamil/Telugu documents, and generate outputs with citations to Indian legal precedent—all without data leaving the country.


02. Enterprise RAG over internal documents

Problem: Enterprises have terabytes of internal docs (contracts, emails, Confluence, Slack) but can't send them to OpenAI/Anthropic due to confidentiality.

Solution: Command A+ runs on-premises with native citations, enabling:

  • Secure RAG over sensitive documents
  • Grounding spans linking answers to exact source passages
  • Low-latency inference (2× faster output vs previous Command A)

Example: A law firm deploys Command A+ on-premises, indexes 50K case files, and lets associates query "What are the precedents for X?" with answers citing specific case numbers and paragraphs.


03. Critical infrastructure (defense, healthcare, finance)

Problem: Defense agencies, hospitals, and banks can't use cloud APIs for classified/sensitive workflows.

Solution: Command A+ runs air-gapped on local hardware:

  • No internet connection required (model weights stored locally)
  • Full control over data (no external API calls)
  • Native citations for audit trails

Example: A hospital deploys Command A+ for clinical decision support—physicians query "What are the treatment guidelines for X?" and get answers citing exact NIH/CDC guidelines with page numbers.


Deployment: 2 H100s, vLLM, transformers

Hardware requirements:

FormatGPUsVRAMThroughput
BF168× H100~400GBBaseline
FP84× H100~200GB1.5× faster
W4A42× H100~80GB2× faster
Blackwell B2001× GPU192GB2× faster

Software stack:

  • vLLM: Recommended for production serving (high throughput, low latency)
  • transformers: Standard HF API works (slower, good for prototyping)
  • LiteLLM: Unified API for multiple providers

Installation:

# W4A4 quantized model
pip install transformers torch

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "CohereLabs/command-a-plus-05-2026-w4a4",
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("CohereLabs/command-a-plus-05-2026-w4a4")

inputs = tokenizer("What were Q4 2025 revenues?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

vLLM serving:

pip install vllm

vllm serve CohereLabs/command-a-plus-05-2026-w4a4 \
    --tensor-parallel-size 2 \
    --quantization w4a4 \
    --max-model-len 8192

Command A+ vs other frontier models

FactorCommand A+Llama 3 (405B)Mistral Large 2GPT-4Gemini 1.5 Pro
Parameters218B (25B active)405B (dense)~100B (MoE)UnknownUnknown
LicenseApache 2.0 (full)Meta (restricted)Mistral (tiered)ClosedClosed
Deployment2 H100s (W4A4)8+ H100s4 H100sAPI onlyAPI only
Native citationsYesNoNoNoNo
Languages48~20~12~50+~50+
Sovereign AIYesRestrictedNoNoNo
Speed (vs baseline)2× fasterBaseline1.5× fasterUnknownUnknown

When to choose Command A+:

  • You need on-premises deployment (sensitive data, air-gapped environments)
  • You need native citations with explicit grounding spans
  • You want full Apache 2.0 license (no vendor lock-in)
  • You value efficiency (2 H100s vs 8+)
  • You work in non-European languages (Arabic, Hindi, Chinese, etc.)

When to choose alternatives:

  • You want maximum performance regardless of cost (GPT-4, Gemini 1.5 Pro)
  • You need API convenience over self-hosting (OpenAI, Anthropic)
  • You need vision/multimodal (Gemini, GPT-4o)—Command A+ is text-only

Limitations and future work

Text-only (no vision): Command A+ processes text and structured data, but not images/video. For multimodal, use GPT-4o or Gemini.

Context window unspecified: Cohere has not disclosed the context length (likely 8K-32K based on similar models).

Public benchmarks pending: As of May 22, 2026, detailed MMLU/HellaSwag/TruthfulQA scores are not public.

W4A4 quantization details proprietary: Cohere describes W4A4 as "lossless" but has not released the quantization method openly.

Future work:

  • Longer context (128K+ windows)
  • Vision/multimodal support
  • Public benchmark suite
  • Open-source quantization recipes

Related on ExplainX


Sources


Model capabilities, benchmark scores, and deployment options may change with future releases. Treat this as May 22, 2026 context—verify performance claims and license terms at cohere.com before production deployment. Command A+ is fully Apache 2.0 licensed; commercial, government, and military use is permitted without restriction.

Related posts