Cloudflare Workers AI
Status: Production Ready โ
Last Updated: 2026-01-21
Dependencies: cloudflare-worker-base (for Worker setup)
Latest Versions: [email protected], @cloudflare/[email protected], [email protected]
Recent Updates (2025):
- April 2025 - Performance: Llama 3.3 70B 2-4x faster (speculative decoding, prefix caching), BGE embeddings 2x faster
- April 2025 - Breaking Changes: max_tokens now correctly defaults to 256 (was not respected), BGE pooling parameter (cls NOT backwards compatible with mean)
- 2025 - New Models (14): Mistral 3.1 24B (vision+tools), Gemma 3 12B (128K context), EmbeddingGemma 300M, Llama 4 Scout, GPT-OSS 120B/20B, Qwen models (QwQ 32B, Coder 32B), Leonardo image gen, Deepgram Aura 2, Whisper v3 Turbo, IBM Granite, Nova 3
- 2025 - Platform: Context windows API change (tokens not chars), unit-based pricing with per-model granularity, workers-ai-provider v3.0.2 (AI SDK v5), LoRA rank up to 32 (was 8), 100 adapters per account
- October 2025: Model deprecations (use Llama 4, GPT-OSS instead)
Quick Start (5 Minutes)
{ "ai": { "binding": "AI" } }
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const stream = await env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
messages: [{ role: 'user', content: 'Tell me a story' }],
stream: true,
});
return new Response(stream, {
headers: { 'content-type': 'text/event-stream' },
});
},
};
Why streaming? Prevents buffering in memory, faster time-to-first-token, avoids Worker timeout issues.
Known Issues Prevention
This skill prevents 7 documented issues:
Issue #1: Context Window Validation Changed to Tokens (February 2025)
Error: "Exceeded character limit" despite model supporting larger context
Source: Cloudflare Changelog
Why It Happens: Before February 2025, Workers AI validated prompts using a hard 6144 character limit, even for models with larger token-based context windows (e.g., Mistral with 32K tokens). After the update, validation switched to token-based counting.
Prevention: Calculate tokens (not characters) when checking context window limits.
import { encode } from 'gpt-tokenizer';
const tokens = encode(prompt);
const contextWindow = 32768;
const maxResponseTokens = 2048;
if (tokens.length + maxResponseTokens > contextWindow) {
throw new Error(`Prompt exceeds context window: ${tokens.length} tokens`);
}
const response = await env.AI.run('@cf/mistral/mistral-7b-instruct-v0.2', {
messages: [{ role: 'user', content: prompt }],
max_tokens: maxResponseTokens,
});
Issue #2: Neuron Consumption Discrepancies in Dashboard
Error: Dashboard neuron usage significantly exceeds expected token-based calculations
Source: Cloudflare Community Discussion
Why It Happens: Users report dashboard showing hundred-million-level neuron consumption for K-level token usage, particularly with AutoRAG features and certain models. The discrepancy between expected neuron consumption (based on pricing docs) and actual dashboard metrics is not fully documented.
Prevention: Monitor neuron usage via AI Gateway logs and correlate with requests. File support ticket if consumption significantly exceeds expectations.
const response = await env.AI.run(
'@cf/meta/llama-3.1-8b-instruct',
{ messages: [{ role: 'user', content: query }] },
{ gateway: { id: 'my-gateway' } }
);
Issue #3: AI Binding Requires Remote or Latest Tooling in Local Dev
Error: "MiniflareCoreError: wrapped binding module can't be resolved (internal modules only)"
Source: GitHub Issue #6796
Why It Happens: When using Workers AI bindings with Miniflare in local development (particularly with custom Vite plugins), the AI binding requires external workers that aren't properly exposed by older unstable_getMiniflareWorkerOptions. The error occurs when Miniflare can't resolve the internal AI worker module.
Prevention: Use remote bindings for AI in local dev, or update to latest @cloudflare/vite-plugin.
// wrangler.jsonc - Option 1: Use remote AI binding in local dev
{
"ai": { "binding": "AI" },
"dev": {
"remote": true // Use production AI binding locally
}
}
npm install -D @cloudflare/vite-plugin@latest
npm run dev
Issue #4: Flux Image Generation NSFW Filter False Positives
Error: "AiError: Input prompt contains NSFW content (code 3030)" for innocent prompts
Source: Cloudflare Community Discussion
Why It Happens: Flux image generation models (@cf/black-forest-labs/flux-1-schnell) sometimes trigger false positive NSFW content errors even with innocent single-word prompts like "hamburger". The NSFW filter can be overly sensitive without context.
Prevention: Add descriptive context around potential trigger words instead of using single-word prompts.
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'hamburger',
});
const response = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A photo of a delicious large hamburger on a plate with lettuce and tomato',
num_steps: 4,
});
Issue #5: Image Generation Error 1000 - Missing num_steps Parameter
Error: "Error: unexpected type 'int32' with value 'undefined' (code 1000)"
Source: Cloudflare Community Discussion
Why It Happens: Image generation API calls return error code 1000 when the num_steps parameter is not provided, even though documentation suggests it's optional. The parameter is actually required for most Flux models.
Prevention: Always include num_steps: 4 for image generation models (typically 4 for Flux Schnell).
const image = await env.AI.run('@cf/black-forest-labs/flux-1-schnell', {
prompt: 'A beautiful sunset over mountains',
num_steps: 4,
});
Issue #6: Zod v4 Incompatibility with Structured Output Tools
Error: Syntax errors and failed transpilation when using Stagehand with Zod v4
Source: GitHub Issue #10798
Why It Happens: Stagehand (browser automation) and some structured output examples in Workers AI fail with Zod v4 (now default). The underlying zod-to-json-schema library doesn't yet support Zod v4, causing transpilation failures.
Prevention: Pin Zod to v3 until zod-to-json-schema supports v4.
npm install zod@3
{
"dependencies": {
"zod": "~3.23.8" // Pin to v3 for compatibility
}
}
Issue #7: AI Gateway Cache Headers for Per-Request Control
Not an error, but important feature: AI Gateway supports per-request cache control via HTTP headers for custom TTL, cache bypass, and custom cache keys beyond dashboard defaults.
Source: AI Gateway Caching Documentation
Use When: You need different caching behavior for different requests (e.g., 1 hour for expensive queries, skip cache for real-time data).
Implementation: See AI Gateway Integration section below for header usage.
API Reference
env.AI.run(
model: string,
inputs: ModelInputs,
options?: { gateway?: { id: string; skipCache?: boolean } }
): Prom