What is the Caveman skill?
The Caveman skill is an open agent skill (JuliusBrussee/caveman) that constrains how much filler prose assistants emit—lite, full, and ultra modes—while keeping code blocks and technical payloads intact. Install it like any other skill from the registry; it complements prompt caching, batch APIs, and model routing, not replace them.
Bottom line (April 2026): public API rate cards from OpenAI and Anthropic still charge more per token for output than for input on flagship coding models, and agent pipelines multiply every wasted completion across later turns. The Caveman skill targets low-value prose, not semantics. Recent preprint work (MD Azizul Hakim, arXiv:2604.00025, 11 Mar 2026) ties scale-dependent verbosity to benchmark errors and shows brevity constraints can recover large-model advantages.
Why this post exists
Most writing about LLM cost optimization stops at:
- "Use a cheaper model."
- "Make prompts shorter."
Both help, but they skip the systems view:
- how per-token economics evolved from early GPT-4-class APIs to 2026 frontier listings
- why output and carried conversation state dominate many coding and agent bills
- when shorter answers are a reliability lever, not only a budget lever
Caveman is the concrete example; the through-line is token economics and measurement.
First principles: what you are actually paying for
For commercial APIs, cost is usually a weighted sum of:
- Input tokens (full-price vs cached input where the provider supports reuse)
- Output tokens (often 1.25–6× input price on comparable tiers—exact ratio depends on model and vendor)
- Tool charges (hosted search, code execution, retrieval)
- Tier modifiers (batch/async discounts, flex vs “priority,” data residency uplifts)
The expensive mistake is treating “tokens” as one scalar. Buckets and multipliers differ; optimizations that trim output help most when output is priced highest.
Token cost history: anchors that still matter
Public archives and current listings tell a three-act story:
- Early frontier (2023): strongest general models shipped at tens of dollars per million input tokens—OpenAI’s archived tables include
gpt-4-0613at $30 / 1M input and $60 / 1M output. - Efficiency wave (2024–2025): multimodal and “mini” classes pushed routine work toward sub-$5 / 1M input territory—e.g.
gpt-4o-2024-05-13at $5 / $15 per 1M andgpt-4o-miniannounced at $0.15 / $0.60 (OpenAI, July 18, 2024). - 2026 frontier snapshot: flagship SKUs remain output-heavy even as quality improves. As of April 2026, OpenAI’s published API pricing shows GPT-5.4 at $2.50 / 1M input, $0.25 / 1M cached input, $15.00 / 1M output (standard rates under 270K context); GPT-5.4 mini at $0.75 / $0.075 / $4.50; GPT-5.4 nano at $0.20 / $0.02 / $1.25. The same page notes Batch API saves 50% on eligible input and output, web search at $10.00 per 1,000 calls (search content tokens listed as free), and a +10% uplift for certain data-residency / regional endpoints on models released after March 5, 2026.
On Anthropic’s side, the April 2026 model pricing table lists, for example, Claude Sonnet 4.6 at $3 / 1M input and $15 / 1M output, Claude Opus 4.6 at $5 / $25, and Claude Haiku 4.5 at $1 / $5, with cache reads at 0.1× base input after a cache write and Batch API at 50% off both input and output for supported workloads—so the same “shrink repeated context + cut output” playbook applies across vendors.
Net: unit costs fell, but output‑token weight and tooled workflows keep waste material.
Why verbosity still hurts after price drops
Agentic coding stacks often chain:
- planner call
- router / tool-selection call
- patch generation
- explanation or review
- retry loops
If each hop adds 20–40% conversational padding, you pay repeatedly in:
- downstream input: prior verbose turns become context on the next call
- latency and review drag
- error surface: filler correlates with contradiction and “helpful” hedging
Tokenization: why “word count” misleads
OpenAI’s consumer docs still use handy English heuristics: ~4 characters ≈ 1 token and ~75 words ≈ 100 tokens (see What are tokens?). Production caveats:
- Language and script change token efficiency.
- JSON, markdown fences, stack traces, and tool envelopes inflate tokens versus what humans “see.”
So a “short” visible answer can still be a large billed payload.
Research note: brevity as an intervention, not just aesthetics
In “Brevity Constraints Reverse Performance Hierarchies in Language Models” (MD Azizul Hakim; submitted 11 Mar 2026; arXiv:2604.00025), the author evaluates 31 models (~0.5B–405B) on 1,485 problems and reports that on 7.7% of items across five datasets, larger models trail smaller ones by 28.4 percentage points, a pattern attributed in part to verbosity-induced overelaboration. Causal interventions with brevity constraints raise large-model accuracy by ~26 percentage points and invert prior hierarchies on math and science subsets, with 7.7–15.9 point swings—supporting the deployment idea that prompt shape is a first-class control, not cosmetic.
Where Caveman fits
Caveman (see the Caveman skill on ExplainX and the project site) is a response-style constraint layer for agentic CLIs: modes like lite, full, ultra, add-ons for terse commits/reviews, and caveman-compress for shrinking session memory-style inputs. Architecturally it targets communication overhead, not reasoning capability—compress surface language, preserve semantic payload, measure quality.
On ExplainX: explore the full skills registry; for content-facing agent playbooks see SEO + GEO agent skills; to list your own skill, register and use the submission flow.
Cost math: a sanity model
Use:
monthly_cost ≈ Σ (input_tokens × input_rate + output_tokens × output_rate + tool_fees)
If style changes cut output tokens by fraction r without harming task success:
output_savings ≈ monthly_output_tokens × output_rate × r
Example: 200M output tokens/month at $10 / 1M output with r = 0.35 → $700/month saved on that slice alone—before counting downstream input shrinkage.
Platform mechanics teams overlook
Three levers compound with terse defaults:
- Cached / repeated system or document context (OpenAI cached-input rows; Anthropic cache hits at 0.1× base input after writes).
- Batch / async lanes (both vendors advertise 50% token discounts for eligible batch workloads in their public pricing docs as of April 2026).
- Model routing—frontier models only on high-ambiguity steps; mini / nano / Haiku-class for transforms and scaffolding.
Deployment playbook
- Baseline three regimes: default prompting; manual “be concise”; Caveman-style (or equivalent system policy).
- Jointly track cost, latency, and task success—not cost alone.
- Slice metrics by task family (debug, refactor, architecture, review).
- Keep an “expand” escape hatch (
explain more, verbose sub-agent). - Default terse where safe; escalate detail when confidence is low or stakeholders require auditability.
Failure modes
Brevity-first defaults fail when legal / compliance language must be explicit, learners need expository depth, or traceability belongs inside the reply. Use selective verbosity, not universal.
Caveman as pattern, not meme
Narrowly: “funny terse mode.” Properly: measured surface compression paired with routing, caching, and evaluation—now part of cost-aware agent design.
FAQ (same topics as structured data above)
These answers mirror the FAQ block in this page’s metadata (for search and AI overviews). Caveman skill install: explainx.ai/skills/JuliusBrussee/caveman/caveman.
- Cost drivers (2026): input + output tokens (output often higher per token), cached input discounts, batch/flex tiers, tools (e.g. OpenAI web search $10 / 1,000 calls), regional uplifts (+10% on some OpenAI endpoints for post–Mar 5, 2026 models).
- Why verbosity still hurts: chained agents re-ingest prior completions as context; Hakim (arXiv:2604.00025) shows brevity constraints can raise large-model accuracy on part of the benchmark set by ~26 points.
- When not to default terse: compliance narrative, training depth, or audit text that must live in the reply—use route- or audience-specific verbosity.
Related links
- Caveman on ExplainX: explainx.ai/skills/JuliusBrussee/caveman/caveman
- Caveman repository: github.com/JuliusBrussee/caveman
- Caveman documentation site: juliusbrussee.github.io/caveman
Sources
- OpenAI API pricing (GPT-5.4 family, Batch API, web search, regional surcharge note; retrieved Apr 2026)
- OpenAI platform pricing documentation (historical and versioned tables)
- OpenAI: GPT-4o mini announcement (July 18, 2024)
- OpenAI: what are tokens
- Anthropic Claude pricing (model table, caching, batch; retrieved Apr 2026)
- Anthropic token counting
- tiktoken (BPE reference)
- MD Azizul Hakim, Brevity Constraints Reverse Performance Hierarchies in Language Models, arXiv:2604.00025, submitted 11 Mar 2026. Abstract