OpenAI and Broadcom (NASDAQ: AVGO) on June 24, 2026 unveiled Jalapeño, OpenAI's first custom AI accelerator—an Intelligence Processor designed from scratch for LLM inference, not adapted from a general-purpose GPU lineage. The chip was delivered to OpenAI CEO Sam Altman and President Greg Brockman by Broadcom President and CEO Hock Tan and President Charlie Kawwas. Engineering samples are already running GPT-5.3-Codex-Spark at production-target frequency and power in the lab.
Primary sources:
- OpenAI and Broadcom unveil LLM-optimized inference chip — official announcement, chip architecture rationale, partnership scope.
- OpenAI X post (@OpenAI, Jun 24 2026) — public announcement thread.
Why OpenAI Built Its Own Silicon
The core argument Greg Brockman made in the announcement is simple: inference is where intelligence reaches people. Every ChatGPT response, every Codex task, every API call runs through inference hardware. If you own that hardware, you control cost, latency, and capacity in ways you cannot if you depend entirely on third-party silicon.
OpenAI has until now run on Nvidia GPUs—powerful, battle-tested, and expensive. GPUs were designed first for graphics, then adapted for matrix math, then further adapted for transformer models. Each layer of adaptation leaves efficiency on the table. Jalapeño starts from the premise: what if we designed a chip purely around the kernels, memory-movement patterns, and networking behaviors of modern LLMs?
Richard Ho, who leads OpenAI's hardware program, described it this way: "We optimized the architecture around the kernels, memory movement, networking, and serving patterns that matter most for frontier AI models."
The flywheel OpenAI is betting on:
- Custom silicon → better inference efficiency
- Better efficiency → cheaper and faster serving
- Cheaper serving → more capable products and lower API prices
- More usage → more revenue → reinvest in next-generation infrastructure
- Better infrastructure → next-generation models
- Better models → help design next chip (OpenAI models assisted Jalapeño's own design)
That last loop is the one Brockman highlighted explicitly: the same models served to users helped design the chip that will run future models.
Nine-Month Tape-Out: A Record in Advanced Semiconductors
Traditional ASIC development at advanced nodes (3nm, 5nm) typically takes 18–36 months from initial architecture to first silicon. Enterprise-class AI accelerators—TPU, Trainium, Gaudi—have generally required 2–3 years of dedicated engineering before reaching production.
OpenAI and Broadcom claim nine months from initial design to manufacturing tape-out. Three factors made this possible:
1. Deep Software-Hardware Co-Development
OpenAI writes and operates its own inference stack end-to-end: kernels, serving frameworks, batching logic, routing, and product APIs. That means the hardware team had direct access to the exact compute patterns to optimize for—no approximations from external benchmarks. The chip architecture was shaped by real production request traces from ChatGPT and the API.
2. Broadcom's Silicon Implementation Expertise
Broadcom has decades of experience implementing complex ASICs. Their Tomahawk networking silicon family powers some of the largest hyperscale data centers. Bringing that implementation discipline to OpenAI's architecture spec allowed parallel workstreams—architecture definition, physical design, verification, packaging—to proceed faster than a single-vendor effort could.
3. OpenAI Models Accelerating Chip Design
Perhaps the most forward-looking detail: OpenAI used its own models to accelerate "parts of the design and optimization process." This likely includes:
- RTL generation: LLMs producing hardware description language (VHDL/SystemVerilog) for repetitive logic blocks
- Verification test generation: Models writing exhaustive test cases for timing and functional correctness
- Design-space exploration: Using AI to evaluate thousands of micro-architecture variants (cache sizes, pipeline depths, memory bandwidths) faster than human engineers could manually
- Documentation and constraint generation: Auto-generating synthesis constraints and signoff checklists
This is significant because it suggests the chip-design-with-AI loop is already real and productive, not a roadmap slide.
Architecture: What "Blank Slate for LLM Inference" Means
OpenAI has not released a detailed datasheet yet (a full technical report is promised "in the coming months"), but the announcement describes the design philosophy in enough detail to understand the architectural bets:
Reducing Data Movement
Data movement is the primary bottleneck—and energy consumer—in LLM inference. Every token generation requires:
- Loading model weights from memory into compute units
- Computing attention over the KV cache (past context)
- Accumulating activations through MLP layers
- Sampling from the output distribution
In a general-purpose GPU, these memory loads traverse a generic memory hierarchy (registers → L1 → L2 → HBM) not tuned for transformer access patterns. Jalapeño, by contrast, was designed knowing exactly which data structures (weight matrices, KV tensors, activation buffers) get accessed in which order.
The announcement says the architecture "reduces data movement"—meaning the chip likely has:
- Larger on-chip SRAM sized to hold key working sets
- Custom memory controllers optimized for transformer weight shapes
- Fused kernel support that keeps intermediate activations in registers rather than spilling to DRAM
Balancing Compute, Memory, and Networking
A common failure mode in AI accelerator design is roofline imbalance: more raw FLOPS than memory bandwidth can feed, leaving compute units idle waiting for data. Jalapeño was apparently designed to avoid this by co-optimizing all three dimensions against OpenAI's actual model shapes and serving patterns.
For an inference chip, the relevant ratios look different than for a training chip:
| Dimension | Training priority | Inference priority |
|---|---|---|
| Raw FLOPS | Maximize | Match to memory BW |
| Memory BW | High | Critical (weight streaming) |
| On-chip SRAM | Moderate | Large (KV cache) |
| Network latency | Tolerable | Low (request tail latency) |
| Power per token | Secondary | Primary (cost structure) |
OpenAI designs its models, kernels, and serving system—so the chip team could target the exact model widths and depths, attention head counts, and batch sizes that appear in production. This produces much better realized utilization vs. theoretical peak.
Networking via Broadcom Tomahawk
Broadcom's Tomahawk networking silicon is a hyperscale Ethernet switch series. At gigawatt-scale deployment, Jalapeño nodes will need to form large inference clusters for:
- Tensor parallelism: Splitting a single large model across many chips
- Pipeline parallelism: Distributing model layers across a deep chip pipeline
- Disaggregated serving: Prefill nodes (processing input prompts) and decode nodes (generating tokens) on separate hardware
Tomahawk's high-radix, low-latency fabric is a natural fit—it's already deployed at the scale of thousands of servers in hyperscaler data centers. Integrating it into the Jalapeño platform from day one signals OpenAI is targeting not just single-chip performance but cluster efficiency at gigawatt scale.
What GPT-5.3-Codex-Spark Running in Lab Means
The fact that engineering samples are running GPT-5.3-Codex-Spark (a production model) at production-target frequency and power is a stronger signal than it might appear.
Most AI chip announcements show toy workloads or synthetic benchmarks in early silicon. Running a real production model means:
- Firmware and kernel stack is functional end-to-end
- Memory controllers and interconnect are delivering sufficient bandwidth for the model
- Numerical correctness (FP8/FP16 outputs matching reference) is validated
- Power and thermals are within target envelope at production frequency
"Production target frequency and power" means the chip isn't running slowly to avoid errors—it's at the operating point it will ship at. That's a meaningful milestone for a nine-month project.
OpenAI says "while we are still measuring final performance"—the final numbers haven't been locked, but engineering samples are functioning correctly at spec.
The Full-Stack Strategy: Products → Models → Infrastructure → Silicon
Jalapeño completes a vertical integration stack that previously stopped at software:
ChatGPT / Codex / API (Products)
↓
GPT-5.x / o-series (Models)
↓
Kernels / Serving / Routing (Software)
↓
Jalapeño + Tomahawk (Silicon) ← new
↓
Data center (Microsoft, partners)
The analogy often cited is Apple's A-series chips. Apple designing both iOS and the A-chip allowed optimizations impossible when hardware and software came from separate companies. iPhone's battery life and performance-per-watt lead over Android for years was largely attributable to vertical integration.
For AI inference, the analogous opportunity is per-token cost and latency. If Jalapeño delivers the claimed performance-per-watt advantage over current state-of-the-art (a detailed number is forthcoming), OpenAI could:
- Lower API prices without cutting margins
- Increase context window affordably (longer KV caches need more memory bandwidth—custom silicon can deliver this more cheaply)
- Prioritize capacity for products over external competitors who also use Nvidia
- Reduce exposure to Nvidia supply constraints and pricing power
Partner Roles: Broadcom and Celestica
Broadcom
Broadcom's role goes beyond silicon implementation:
- Chip implementation: Physical design, timing closure, DFT, packaging at advanced node
- Networking silicon: Tomahawk switches for cluster-scale interconnect
- Connectivity technologies: SerDes, optical interfaces, and system-level integration know-how
Broadcom has deep relationships with TSMC for advanced node manufacturing and experience managing the complexity of 5nm/3nm tape-outs. Their ecosystem of chiplet packaging and CoWoS-adjacent technologies may also be relevant for Jalapeño's memory subsystem.
Celestica
Celestica handles the mechanical and system layer:
- Board design: PCB layout, signal integrity for high-speed memory interfaces
- Rack integration: Power delivery, cooling, cabling at rack level
- System production: Manufacturing at scale, quality control, supply chain management
Custom AI accelerators need custom boards—the connector layout, power delivery network, and thermal solution all differ from standard GPU server designs. Celestica's manufacturing scale enables the rapid ramp to gigawatt-scale deployment that Broadcom CEO Hock Tan referenced.
Gigawatt Scale: What That Actually Means
Hock Tan's quote—"deployment of gigawatt scale data centers with Microsoft and other partners beginning in 2026"—translates to enormous numbers.
Power math:
- A modern AI accelerator consumes ~200–400W per chip
- A 1 MW data center holds ~2,500–5,000 chips
- A 1 GW deployment holds ~2.5–5 million chips
That's not one data center. It's a network of hyperscale facilities—consistent with Microsoft's widely reported $80B+ AI infrastructure investment commitment for 2025–2026, much of which is co-located with OpenAI.
Why inference, not training?
Training is a one-time (or periodic) cost. Inference is continuous—every ChatGPT user generates inference traffic every second they use the product. At OpenAI's scale (reported 400M+ weekly users), inference hardware needs to be massive to keep latency acceptable. Training can be done in bursts on rented GPU clusters; inference needs to be reliable, low-latency, and always available.
Implications for the AI Infrastructure Landscape
For OpenAI Users and Developers
If Jalapeño delivers on its performance-per-watt promise:
- Faster responses: Lower latency for ChatGPT and Codex
- Cheaper API: More efficient infrastructure can translate to lower per-token prices
- Longer context: Memory-efficient inference enables larger context windows affordably
- Better reliability: Owning the hardware stack reduces dependency on third-party supply chains
For Nvidia
Nvidia's data center GPU business has been the primary beneficiary of the AI boom. OpenAI is their largest reported customer. If Jalapeño can serve OpenAI's inference workloads more efficiently, it directly reduces future Nvidia GPU demand from OpenAI—at least for inference.
Training, however, remains a different story. Nvidia's H100/B200 architecture has a network effect around CUDA tooling, existing training frameworks, and researcher familiarity. OpenAI is unlikely to replace training hardware immediately even if Jalapeño succeeds at inference.
For Other AI Labs
The signal Jalapeño sends is clear: at sufficient scale, building custom inference silicon is worth it. Google has been doing this with TPUs since 2016. Amazon has Trainium (training) and Inferentia (inference). Meta has MTIA. OpenAI joining this group means the large labs are increasingly competing on infrastructure, not just on model quality.
Smaller labs without the revenue to justify ASIC development face a structural cost disadvantage as inference silicon becomes more efficient for each incumbent.
For Broadcom's Business
Broadcom CEO Hock Tan called this "a fundamental commitment to scaling the physical infrastructure required for the next decade of AI." The multi-generation chip platform agreement with OpenAI is a new, high-volume revenue stream diversified from Broadcom's networking switch business.
Broadcom also works with Google on TPU silicon and has similar relationships with Apple and other hyperscalers. Jalapeño adds OpenAI to their AI accelerator customer list.
Technical Deep Dive: Why Inference Silicon Differs from Training Silicon
Understanding why inference chips look different from training chips is useful context for anyone building on AI APIs.
Training vs. Inference: Different Bottlenecks
Training is compute-bound: the goal is to maximize FLOPS for gradient computation across billions of parameters.
Inference is memory-bandwidth-bound: the goal is to stream model weights (often tens to hundreds of gigabytes) through compute units as fast as possible to generate tokens one at a time.
A simple illustration for a 70B parameter model:
- Weight size at FP8: 70B × 1 byte ≈ 70 GB
- Generating 1 token requires reading all 70 GB (one forward pass)
- Generating 100 tokens/second requires 7 TB/s of memory bandwidth
No current HBM technology delivers 7 TB/s on a single chip (H100 delivers ~3.35 TB/s). This forces batching: combining multiple users' requests so each weight load serves multiple tokens. The larger the batch, the more efficient the compute—but larger batches increase latency for individual users.
Jalapeño's design presumably addresses this tradeoff by:
- Maximizing HBM bandwidth per chip
- Maximizing on-chip SRAM to cache frequently-used weights and KV tensors
- Minimizing data movement by keeping activations on-chip across layers
- Optimizing batch scheduling to maximize realized throughput per watt
The KV Cache Problem
KV cache stores attention keys and values from earlier tokens in a conversation. It grows linearly with context length. For a 128K context window with a large model:
- KV cache size per user ≈ 32 GB (approximate for a 70B model at 128K context)
- For 1,000 concurrent users: 32 TB of KV cache storage
This means inference at scale requires either:
- Very large on-chip or near-chip memory
- Efficient KV cache compression (quantization, eviction policies)
- Disaggregated memory tiers (fast NVMe or CXL-attached memory for older context)
Jalapeño's blank-slate design almost certainly includes specific hardware support for KV cache management—a feature absent in general-purpose GPUs.
Networking for Disaggregated Inference
Modern serving systems separate prefill (processing the input prompt—compute-heavy) from decode (generating output tokens one at a time—memory-bandwidth-heavy). Disaggregating these onto separate hardware pools improves utilization.
Jalapeño + Tomahawk enables fast interconnects between prefill nodes and decode nodes. This is similar to what Google described for TPU 8i's Boardfly topology and Collectives Acceleration Engine, suggesting the industry is converging on disaggregated inference as the serving paradigm for large-scale LLM products.
What This Means for AI Education and Skill Building
The Jalapeño announcement is easy to read as purely an infrastructure story. But it has clear implications for practitioners building on AI:
API Pricing Will Change
If Jalapeño delivers the efficiency gains OpenAI claims, downstream API pricing will likely fall—continuing the trend of rapidly declining per-token costs. This matters for:
- Agent pipelines with high token volumes (multi-turn reasoning, long document processing)
- Real-time applications where latency is the current blocker
- Startups that couldn't afford frontier-model API costs at their required scale
Budget your roadmap knowing inference costs are likely to fall over 2026–2027.
Infrastructure Ownership Is a Moat
For anyone building on OpenAI APIs: the company's vertical integration—from silicon to serving to product—creates a compounding advantage. Their cost structure for inference will improve faster than labs relying on third-party hardware. Build on APIs while understanding that the underlying economics will shift, and position your product on differentiation above the infrastructure layer.
The Full-Stack Flywheel Is Real
OpenAI using its own models to design the chip that will run future models is not a press release flourish—it's a concrete demonstration of the AI capability flywheel. Labs with deployed products and real revenue can invest those proceeds into infrastructure improvements that make future products better. This dynamic accelerates as the loop tightens.
Learn the Stack, Not Just the API
Understanding how inference hardware works—memory bandwidth, batching, KV cache—makes you a better AI engineer even if you never touch hardware. The constraints of inference silicon explain why context windows have pricing tiers, why latency varies with load, and why some model sizes are more cost-effective than others. These are skills that transfer across providers.
Courses and bootcamps at ExplainX cover these fundamentals—not just "how to call the API" but "why the API behaves the way it does and how to build systems that use it efficiently."
What to Watch For
- Technical report: OpenAI promises a detailed performance report "in the coming months." The headline claim—"performance per watt substantially better than current state-of-the-art"—will need numbers against a defined baseline (H100? H200? B200?) to be meaningful.
- API price changes: Watch for pricing updates to GPT-5.3 and Codex models in H2 2026 as Jalapeño comes online.
- Second generation: This is described as a multi-generation platform. Architectural lessons from Jalapeño v1 (measured production behavior, observed bottlenecks) will inform v2—likely on a faster development cadence given the nine-month precedent.
- Training silicon: Jalapeño is inference-only. If OpenAI eventually targets training with custom silicon, it would represent a more fundamental break from Nvidia dependency.
- Competitive response: Google has TPU 8t/8i. Amazon has Trainium/Inferentia. Microsoft/AMD partnerships are evolving. The AI chip competitive landscape is moving fast; Jalapeño is a signal that OpenAI intends to compete on infrastructure, not just models.
Read next: Google Cloud Next 2026: TPU 8t / TPU 8i and Gemini Enterprise Agent Platform · OpenAI Codex Computer Use: Windows and Mobile · RAG vs MCP: Complete Comparison · What are Agent Skills?
Performance claims are based on OpenAI's June 24, 2026 announcement. Final benchmark numbers, pricing, and availability details will be in the forthcoming technical report. This article is not financial advice.