← Blog
explainx / blog

What are tokens? A plain guide to how LLMs count (and charge for) text

Tokens are the standard units large language models use to read and generate text. Here is what they are, how they differ from words, why input and output are billed separately, and how they connect to context limits, subscriptions, and API pricing—without the jargon pile-on.

17 min readYash Thakker
LLM basicsTokensAI pricingPrompt engineeringAnthropicOpenAI

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

What are tokens? A plain guide to how LLMs count (and charge for) text

If you have ever read a doc that says "32k context" or "$2.50 per million input tokens" and only half-trusted your mental model, this article is the missing layer: what a token is, why providers count them, and how that connects to limits, bills, and rate limits.

Scope: this is a concepts guide. For dollar math, prompt caching, and agent pipelines, read Caveman skill: token economics and API pricing next.


Tokens are not the same as words

In daily language we count words. Under the hood, a large language model consumes a sequence of tokens: integer IDs from a fixed vocabulary, produced by a tokenizer (families you will see in papers include BPE, WordPiece, and vendor-specific schemes).

  • A token can be a short whole word (e.g. hello might be one token).
  • A token can be a subword — long or rare strings are often split into several pieces.
  • Punctuation, spaces, and code are also encoded as one or more tokens. Code and JSON are often longer in token count than a casual glance suggests, because braces, semicolons, and indentation are all billed like anything else.
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

Why it matters: a "short" line in the editor can still be thousands of tokens once the app attaches system instructions, open files, tool schemas, and prior turns.

Heuristics (English prose, ballpark only): people often use ~4 characters per token, or on the order of one token per ¾ of a word. Do not use heuristics for billing—use the provider's tokenizer or usage dashboard for the model you run.

How Tokenization Really Works

When you send text to an LLM, the tokenizer breaks it down using Byte Pair Encoding (BPE) or similar algorithms:

  1. Start with a vocabulary of common subword units learned during training
  2. Match the longest possible sequences from the vocabulary
  3. Convert to integer IDs that the model processes
  4. Each token typically represents 3-4 characters in English

For example, the sentence "Understanding tokenization" might become:

  • ["Under", "stand", "ing", " token", "ization"] (5 tokens)
  • Or [8640, 1302, 287, 11241, 1634] as the model sees it

Different models use different tokenizers, which is why the same text might be:

  • 100 tokens in GPT-4
  • 105 tokens in Claude
  • 98 tokens in Llama

Interactive Token Visualizer

Want to see how your text is tokenized? Try our interactive visualizer below to understand token boundaries and compare costs across different models:


Input vs output tokens

KindWhat countsIntuition
Input (prompt) tokensSystem prompt, your message, full chat history the client sends, retrieved documents, tool parameters and tool results, images (often a separate budget), etc.Everything the model must read to respond.
Output (completion) tokensThe model's generated text (and sometimes separate billed fields, depending on product).Everything the model writes.

Two common surprises:

  1. "I only typed one sentence." The service may still include all prior turns and in-scope files in the request—input can be huge compared to your last line.
  2. Long replies compound: output tokens in turn become input on the next turn, so verbosity in chat and agent loops can inflate both sides of the ledger.

On frontier models, output is often priced higher per token than input—see each vendor's rate card (e.g. OpenAI, Anthropic).

Why Output Costs More

Output tokens are typically 3-5x more expensive than input tokens:

  • GPT-4 Turbo: $10/1M input, $30/1M output (3x)
  • Claude 3.5 Sonnet: $3/1M input, $15/1M output (5x)
  • GPT-3.5 Turbo: $0.50/1M input, $1.50/1M output (3x)

Reasons for the price difference:

  1. Generation is computationally different from reading—each output token requires a full forward pass through the model
  2. Prevents abuse of long-winded responses that would otherwise be "free"
  3. Encourages concise outputs which improve user experience and reduce latency
  4. Reflects actual compute costs of autoregressive generation

Context window: how many tokens fit in one go

The context window (e.g. 128k or 1M in marketing tables) is the maximum combined budget the model is built to process in a single request: your input plus the room reserved for the reply (how the split is defined depends on the API—read the spec for your model).

  • If you exceed the limit, the system may error, truncate early content, or summarizebehavior is not uniform across products.
  • A larger window is not a free pass: it means bigger prompts are possible, which can mean higher API cost or faster burn through subscription credits if the app sends whole trees or long histories by default.

Context Window Comparison (2026)

ModelContext WindowUse Case
GPT-4 Turbo128k tokens~96,000 words or ~300 pages
Claude 3.5 Sonnet200k tokens~150,000 words or ~470 pages
Gemini 1.5 Pro2M tokens~1.5M words or ~4,700 pages
GPT-3.5 Turbo16k tokens~12,000 words or ~37 pages
Llama 3 70B8k tokens~6,000 words or ~18 pages

Important: Just because a model can handle 2M tokens doesn't mean you should use them all:

  • Latency increases with context size
  • Costs scale linearly with input tokens
  • Attention degradation can occur with very long contexts (the "lost in the middle" problem)

Why billing uses tokens (not pages or words)

  1. The model is literally trained and served as a function over token sequences—that is the native interface to the stack.
  2. Token count tracks compute and memory use more consistently than "words" across languages, markup, and code.
  3. Vendors can publish a single table—$/million input and $/million output—that scales with workload size.

You can still plan in paragraphs and files; the invoice will still speak in tokens.

Token Counts by Content Type

Different types of content have wildly different token densities:

Content TypeCharacters per TokenExample
English prose~4 chars"The quick brown fox" = ~4-5 tokens
Code (Python)~3 charsdef hello(): = ~5-6 tokens
JSON data~2.5 chars{"name":"John"} = ~8-10 tokens
Chinese text~1.5 chars"你好世界" = ~6-8 tokens
Compressed/Base64~1.5 charsVery token-heavy

Takeaway: Code and structured data consume tokens faster than you might expect. A 1000-character JSON payload might use 400+ tokens.


"Cached" input (one paragraph)

Some APIs discount long unchanged prefixes of a prompt when they qualify for cached or reused input (rules differ by provider). The idea: if most of an agent's prompt is a stable system block plus tool definitions, you pay less for that slice on the next call when caching hits. For production patterns, see the Caveman post and your vendor's prompt caching documentation.

How Prompt Caching Works

Anthropic's Claude offers prompt caching with dramatic savings:

  • First request: Full input cost ($3/1M tokens for Claude 3.5 Sonnet)
  • Cached requests: $0.30/1M tokens (10x cheaper!) for cached portion
  • Cache duration: ~5 minutes of inactivity

Example savings:

Without caching:
- 50,000 token system prompt + tools = $0.15 per request
- 100 requests = $15.00

With caching:
- First request: $0.15
- Next 99 requests: $0.015 each = $1.485
- Total: $1.635 (89% savings!)

Requirements:

  • Cached prefix must be ≥1024 tokens
  • Must be sent in the same order each time
  • Cache expires after ~5 minutes of inactivity

Subscriptions vs APIs

  • Chat and IDE products often show "messages" or a single usage meter. Underneath, that still maps to model calls and token-like budgets you may not see line by line.
  • API usage pages usually show per-request or per-month token totals, which is closer to marginal cost modeling for an app you ship.

Either way, the scarce resource in aggregate is tokens over time (and provider capacity), which is where rate limits and plan tiers come from.

Subscription vs API: Cost Comparison

Plan TypeExampleToken BudgetBest For
ChatGPT Plus$20/month"Unlimited" with capsCasual users, learning
Claude Pro$20/month5x more usage than freePower users, research
API Pay-as-you-goVariableUnlimited, billed per tokenProduction apps
EnterpriseCustomCustom quotas + SLATeams, mission-critical

Hidden truth: Subscription plans have soft limits enforced by:

  • Rate limits (e.g., 40 messages per 3 hours)
  • Usage caps that reset monthly
  • Throttling during peak hours
  • Different model access tiers

For developers: If you're building an app, API access gives you:

  • Transparent per-token pricing
  • Higher rate limits
  • Programmatic access
  • Fine-grained usage tracking

Practical habits

  1. Measure with your real stack: provider usage APIs, IDE panels, or token counters in CI.
  2. Trim what you add to every turn—large readmes and logs belong behind retrieval or on-demand file reads, not by default in global context, unless you truly need them every time.
  3. Prefer structured, reusable instructions (agent skills and templates) over pasting the same long preamble each session.

Advanced Token Optimization Strategies

1. System Prompt Compression

  • Use abbreviations in internal instructions (the model understands)
  • Remove redundant examples (2-3 good examples > 10 mediocre ones)
  • Leverage few-shot learning sparingly

2. Context Management

  • Implement sliding window for chat history (keep last N turns)
  • Use summarization for old conversations
  • Store embeddings instead of raw text for retrieval

3. Response Control

  • Set max_tokens limits to prevent rambling
  • Use stop sequences to end generation early
  • Request structured outputs (JSON, bullets) which are often shorter

4. Caching Everything You Can

  • Cache system prompts that don't change
  • Cache tool definitions across calls
  • Cache few-shot examples and references

Token Counting Tools

Before deploying, test your actual token usage:

OpenAI Models (GPT-3.5, GPT-4):

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode("Your text here")
print(f"Token count: {len(tokens)}")

Claude Models:

from anthropic import Anthropic

client = Anthropic(api_key="your-key")
count = client.count_tokens("Your text here")
print(f"Token count: {count}")

Online Tools:


Real-World Token Economics

Case Study: Chat Application

Let's calculate costs for a typical customer support chatbot:

Assumptions:

  • 1,000 conversations per day
  • Average: 500 input tokens + 200 output tokens per message
  • 5 messages per conversation
  • Using GPT-4 Turbo ($10 input, $30 output per 1M tokens)

Daily costs:

Input: 1,000 × 5 × 500 = 2.5M tokens
Output: 1,000 × 5 × 200 = 1M tokens

Input cost: 2.5M × $10/1M = $25
Output cost: 1M × $30/1M = $30
Total per day: $55

Monthly: $55 × 30 = $1,650

With optimizations:

  • Caching system prompt (500 tokens) → Save ~$12.50/day
  • Compress to 3 messages history → Save ~$10/day
  • Use GPT-3.5 for simple queries (70%) → Save ~$28/day

Optimized monthly cost: ~$140 (91% savings)

Case Study: Code Documentation Generator

Scenario: Generate docs for 100 repositories

Per repo:

  • Input: 50k tokens (code) + 5k tokens (instructions) = 55k tokens
  • Output: 10k tokens (docs)

Using Claude 3.5 Sonnet ($3 input, $15 output):

Input: 100 × 55k = 5.5M tokens → $16.50
Output: 100 × 10k = 1M tokens → $15.00
Total: $31.50 for all 100 repos

Alternative with Haiku ($0.25 input, $1.25 output):

Total: $2.63 for all 100 repos

Lesson: Match model power to task complexity. Documentation generation doesn't need Sonnet's reasoning.


Common Token Mistakes (and How to Avoid Them)

Mistake #1: Not counting system prompts

Problem: "My prompt is only 50 tokens but I'm billed for 1,500!"

Reality: Your app likely includes:

  • System prompt: 800 tokens
  • Tool definitions: 600 tokens
  • Your message: 50 tokens
  • Total input: 1,450 tokens

Fix: Use console.log or debug mode to see full prompts sent to API.

Mistake #2: Exponential growth in chat apps

Problem: Each message adds both input AND output from previous turn.

What happens:

  • Turn 1: 100 input + 80 output = 180 tokens
  • Turn 2: 100 + 180 (history) + 80 output = 360 tokens
  • Turn 3: 100 + 360 (history) + 80 output = 540 tokens
  • Turn 10: 4,500+ tokens per request

Fix: Implement sliding window or summarization after N turns.

Mistake #3: Sending code files without chunking

Problem: A 10,000-line Python file = ~40,000 tokens

Reality: Most models can't meaningfully process files that large. Attention degrades.

Fix: Use retrieval, chunking, or selective file reading.

Mistake #4: Ignoring cached pricing

Problem: Paying full price when 90% of prompt is identical across calls.

Fix: Structure prompts to put stable content (system prompt, tools) in cacheable prefix.


Deep Dive: The Tokenization Algorithm

Understanding how tokenization actually works helps you write more token-efficient prompts.

Byte Pair Encoding (BPE) Explained

Most modern LLMs use Byte Pair Encoding, invented for text compression and adapted for NLP:

How BPE builds a vocabulary:

  1. Start with all bytes (256 base symbols)
  2. Find the most common byte pair in training data
  3. Merge it into a new token and add to vocabulary
  4. Repeat for N iterations (typically 50k-100k merges)

Example of BPE learning:

Initial: ["t", "h", "e", " ", "q", "u", "i", "c", "k"]
Most common pair: "t" + "h" → merge to "th"
Next: "th" + "e" → merge to "the"
Result: Common words become single tokens

Why this matters:

  • Common words (the, and, is) → 1 token
  • Common subwords (-ing, -tion, un-) → 1 token
  • Rare words → split into multiple tokens
  • Code patterns (def, import, //) → often 1 token

Vocabulary Size and Its Impact

ModelVocabulary SizeImplications
GPT-250,257 tokensSmaller vocab = more splits = longer sequences
GPT-3/4~100,000 tokensBalanced for multilingual use
Claude~100,000 tokensOptimized for code and reasoning
Llama 232,000 tokensSmaller = faster, but more tokens per text

Larger vocabularies:

  • ✅ Fewer tokens per text (cheaper)
  • ✅ Better rare word handling
  • ❌ Larger embedding tables (more memory)
  • ❌ Slower generation (more vocab to sample from)

Smaller vocabularies:

  • ✅ Faster inference
  • ✅ Smaller model files
  • ❌ More tokens per text (more expensive)
  • ❌ Worse rare word handling

Cross-Language Token Efficiency

Token efficiency varies dramatically by language:

Token Cost by Language (relative to English)

LanguageTokens per WordExample Cost Multiplier
English1.0x baseline$10 per 1M words
Spanish1.2x$12 per 1M words
French1.3x$13 per 1M words
German1.4x$14 per 1M words (compound words split more)
Russian1.5x$15 per 1M words (Cyrillic less common in training)
Arabic1.7x$17 per 1M words
Chinese2.0x$20 per 1M words (each character often 1+ tokens)
Japanese2.2x$22 per 1M words (mixing scripts compounds issue)
Korean2.5x$25 per 1M words
Thai3.0x$30 per 1M words (no spaces = poor tokenization)

Why this happens:

  1. Training data bias: Models trained predominantly on English develop English-optimized vocabularies
  2. Character density: Languages using non-Latin scripts get fewer characters per token
  3. Morphology: Agglutinative languages (Turkish, Finnish) create longer word forms
  4. Writing systems: Languages without spaces (Thai, Chinese) split poorly

Real-world impact:

A Thai company using Claude for customer support might pay 3x more per conversation than a US company with identical usage patterns.

Code vs Natural Language

Token efficiency also varies by programming language:

LanguageChars per TokenWhy
Python~3.2Concise syntax, common in training
JavaScript~3.5Similar to Python
Java~2.8Verbose syntax, many keywords
C++~2.6Template syntax, operators
JSON~2.2Braces, quotes, commas each add tokens
YAML~3.0Indentation and colons
SQL~3.5Keywords well-represented

Optimization tip: When sending structured data to LLMs:

  • Prefer JSON for machine parsing (even if token-heavy)
  • Use markdown tables for small datasets the model should read
  • Use CSV for token efficiency with tabular data
  • Avoid XML (most token-inefficient format)

Advanced Token Optimization Playbook

Strategy 1: Prompt Compression Techniques

Before compression (expensive):

You are a helpful AI assistant. Please analyze the following
customer feedback and extract key themes, sentiment, and
actionable insights. Be thorough and detailed in your analysis.
Provide specific examples from the feedback to support your
findings. Format your response with clear headings and bullet points.

Customer feedback: [5000 words of feedback]

Tokens: ~1,400

After compression (cheap):

Analyze feedback. Extract: themes, sentiment, actions.
Use examples. Format: headings, bullets.

[5000 words of feedback]

Tokens: ~1,280 (9% savings)

Aggressive compression:

Extract themes+sentiment+actions from feedback below.
Examples+bullets.

[5000 words of feedback]

Tokens: ~1,260 (10% savings)

Key insight: LLMs understand abbreviated instructions. Save verbose explanations for end users.

Strategy 2: Dynamic Context Windowing

Instead of sending full chat history, implement intelligent windowing:

def get_relevant_context(messages, max_tokens=4000):
    """Keep most recent + most relevant messages within budget"""

    # Always keep system prompt + last 2 messages
    core_messages = [messages[0], messages[-2], messages[-1]]
    core_tokens = count_tokens(core_messages)

    remaining_budget = max_tokens - core_tokens

    # Add older messages by relevance score
    relevant_old = rank_by_relevance(
        messages[1:-2],
        query=messages[-1]
    )

    for msg in relevant_old:
        msg_tokens = count_tokens(msg)
        if msg_tokens <= remaining_budget:
            core_messages.insert(-2, msg)
            remaining_budget -= msg_tokens
        else:
            break

    return core_messages

Savings: 40-70% on long conversations while maintaining quality.

Strategy 3: Streaming and Truncation

Set aggressive max_tokens limits and use streaming to stop early:

response = client.chat.completions.create(
    model="gpt-4-turbo",
    messages=messages,
    max_tokens=500,  # Hard cap
    stream=True,
    stop=["\n\n\n", "---", "In summary"]  # Stop sequences
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if should_stop_early(content):
        break  # Stop streaming
    print(content, end="")

Savings: 20-50% on output costs by preventing rambling.

Strategy 4: Model Routing

Route requests to cheapest capable model:

def route_request(query, context):
    complexity = assess_complexity(query)

    if complexity == "simple":
        # Use cheapest model
        return call_gpt35(query, context)
    elif complexity == "medium":
        # Use mid-tier
        return call_claude_haiku(query, context)
    else:
        # Use premium only when needed
        return call_gpt4(query, context)

Example routing rules:

  • FAQ/simple questions → GPT-3.5 ($0.50 input)
  • Analysis/summarization → Claude Haiku ($0.25 input)
  • Complex reasoning → GPT-4 ($10 input)
  • Code generation → Claude Sonnet ($3 input)

Savings: 60-80% on mixed workloads.


Token Efficiency Patterns by Use Case

Use Case 1: Customer Support Chatbot

Anti-pattern:

# Sends entire knowledge base every time
system_prompt = load_entire_kb()  # 50k tokens

Optimized pattern:

# Semantic search for relevant docs only
relevant_docs = vector_search(query, top_k=3)  # ~2k tokens
system_prompt = f"Use these docs:\n{relevant_docs}"

Savings: 96% on input tokens

Use Case 2: Code Documentation Generator

Anti-pattern:

# Send entire codebase
prompt = f"Document this:\n{entire_repo}"  # 500k+ tokens

Optimized pattern:

# Process file by file with caching
cached_prefix = """You are a code documentor.
Style: concise, examples, JSDoc format."""

for file in files:
    response = generate(
        cached_prefix=cached_prefix,  # Cached!
        prompt=f"Document:\n{file}"   # Only new content
    )

Savings: 90%+ with prompt caching

Use Case 3: Content Moderation

Anti-pattern:

# Use GPT-4 for every message
result = gpt4_moderate(message)  # $10 per 1M tokens

Optimized pattern:

# Fast filter + selective GPT-4
if simple_filter(message):  # Regex, keyword lists
    return "safe"
elif likely_violation(message):  # ML classifier
    return gpt4_moderate(message)  # Only edge cases

Savings: 95%+ by filtering obvious cases


Measuring and Monitoring Token Usage

Essential Metrics to Track

  1. Tokens per Request (TPR)

    • Input TPR: avg input tokens across all requests
    • Output TPR: avg output tokens
    • Target: Establish baseline, reduce by 20% over 3 months
  2. Token Efficiency Ratio (TER)

    TER = Useful Output Characters / Total Tokens
    
    • Measures information density
    • Higher = more efficient
    • Target: TER > 3.0 for production apps
  3. Cost per User Interaction (CPUI)

    CPUI = Total Token Cost / Number of Interactions
    
    • Normalizes across varying conversation lengths
    • Target: CPUI < $0.01 for most B2C apps
  4. Cache Hit Rate

    Hit Rate = Cached Tokens / Total Input Tokens
    
    • Only relevant if using prompt caching
    • Target: >70% for stable system prompts

Monitoring Dashboard Example

import anthropic
from datetime import datetime, timedelta

client = anthropic.Anthropic()

def get_token_analytics(days=7):
    end = datetime.now()
    start = end - timedelta(days=days)

    # Fetch usage data
    usage = client.usage.list(
        start_date=start.isoformat(),
        end_date=end.isoformat()
    )

    total_input = sum(u.input_tokens for u in usage)
    total_output = sum(u.output_tokens for u in usage)
    total_cached = sum(u.cached_tokens for u in usage)

    input_cost = total_input * 3.00 / 1_000_000
    output_cost = total_output * 15.00 / 1_000_000
    cache_savings = total_cached * 2.70 / 1_000_000

    return {
        "total_tokens": total_input + total_output,
        "input_tokens": total_input,
        "output_tokens": total_output,
        "cached_tokens": total_cached,
        "total_cost": input_cost + output_cost,
        "cache_savings": cache_savings,
        "cache_hit_rate": total_cached / total_input if total_input > 0 else 0
    }

Alert Rules

Set up alerts for:

  • Spike alerts: Token usage > 3x daily average
  • Cost alerts: Daily spend > $X threshold
  • Efficiency alerts: TER drops below threshold
  • Cache alerts: Hit rate drops below 50%

Future of Tokenization

Emerging Trends (2026 and Beyond)

1. Character-level models

  • Eliminate tokenization entirely
  • Process raw bytes
  • More expensive but more flexible
  • Example: Google's ByT5

2. Multimodal tokenization

  • Unified tokens for text, images, audio, video
  • Example: GPT-4o uses ~170 tokens per image
  • Future: More efficient image encoding

3. Adaptive tokenization

  • Model learns optimal tokenization per language/domain
  • Reduces multilingual tax
  • Research stage, not production yet

4. Token-free billing

  • Some providers experimenting with time-based pricing
  • Example: "GPU-seconds" instead of tokens
  • Removes optimization incentives (good or bad?)

Best Practices That Will Last

Even as tokenization evolves, these principles remain:

  1. Measure everything - You can't optimize what you don't measure
  2. Cache aggressively - Stable content should be cached
  3. Match model to task - Don't use GPT-4 for simple tasks
  4. Compress prompts - Models understand terse instructions
  5. Monitor costs - Set alerts before bills surprise you

Frequently Asked Questions (Expanded)

Does whitespace count as tokens?

Yes. Spaces, tabs, and newlines are tokenized:

  • Single space: usually part of next word's token
  • Multiple spaces: can be 1-2 tokens
  • Newlines: typically 1 token each
  • Indentation: multiple tokens in code

Optimization tip: Minimize unnecessary whitespace in prompts:

# Bad: 145 tokens
prompt = """
Please analyze this text:

    [Text here]

And provide insights.
"""

# Good: 138 tokens
prompt = "Analyze this text:\n[Text here]\nProvide insights."

Can I save tokens by using abbreviations?

Yes, but carefully:

  • ✅ Common abbreviations: LLM understands "doc", "msg", "txt"
  • ✅ Domain-specific: "API", "SQL", "HTTP" are fine
  • ⚠️ Custom abbreviations: May confuse model or hurt quality
  • ❌ Over-compression: "extr snmnt frm fb" hurts understanding

Test before deploying. Quality matters more than token savings.

Why does my token count differ from the API?

Common causes:

  1. Different tokenizers: GPT-3.5 vs GPT-4 vs Claude
  2. Special tokens: System markers, message boundaries
  3. Invisible formatting: Chat format wrapping
  4. Tool calls: Function definitions add hidden tokens

Solution: Always use official tokenizer for your model:

  • OpenAI: tiktoken library
  • Anthropic: client.count_tokens() method
  • Don't rely on word counts or estimates

What happens if I exceed the context window?

Behavior varies by provider:

OpenAI:

  • Returns error: maximum context length exceeded
  • Request fails, you're not charged
  • Must reduce input or increase max_output

Anthropic:

  • Truncates input (usually oldest messages)
  • Request succeeds but quality degrades
  • You're charged for what was processed

Best practice: Track token counts before sending requests.


Read next


Quick Reference: Token Math Cheat Sheet

1 token ≈ 4 characters (English)
1 token ≈ 0.75 words (English)
1 page (500 words) ≈ 650-750 tokens
1,000 tokens ≈ 3-4 paragraphs

Common costs (2026):
GPT-4 Turbo: $10 in / $30 out per 1M
Claude Sonnet: $3 in / $15 out per 1M
GPT-3.5: $0.50 in / $1.50 out per 1M
Claude Haiku: $0.25 in / $1.25 out per 1M

1M tokens ≈ 750k words ≈ 1,500 pages

Tokenizer behavior and plan limits are vendor- and model-specific; always read the current documentation for the product you use.

Related posts