TL;DR

  • Tokens are the unit of cost. A rough estimate is 1 token ~ 4 English characters or about 0.75 words.
  • Claude Haiku 3.5: - low-cost entry point for high-volume classification and summarization
  • Claude Sonnet 4: - balanced reasoning and speed for production workflows
  • Claude Opus 4.1: - premium reasoning tier for the hardest work
  • GPT-5.5 / GPT-5.2 / GPT-5 mini: / / - current OpenAI reasoning and routing tiers
  • Gemini 3.1 Pro / Flash-Lite: / - Google's current long-context and high-throughput options
  • Grok 4.3: - xAI's current flagship reasoning tier
  • Pricing last verified: May 2026. Rates change frequently. This tool is for estimation, not billing reconciliation.
  • Best practice: Use the Ledger Lab below to calculate your actual costs

Why Token Economics Matter

LLMs charge by the token, not the word or the request. If you're building at scale, token efficiency becomes a core part of your product architecture. Small differences in tokenization, model tier, and output budget can change the cost profile of an entire platform.

This guide starts with the fundamentals: tokens, embeddings, and tokenizers. It then shows how the current Claude, OpenAI, Gemini, and Grok model ladders affect cost, and ends with a link to the standalone TokenOps lab for live estimation.

What Are Tokens

LLMs do not read characters or words directly. They read tokens, which are subword units produced by a tokenizer. You pay per token for input and output, so tokenization is the first layer of model economics.

English prose is usually compact, but code, JSON, logs, and structured prompts often expand into more tokens because punctuation, separators, and symbols get split differently. That is why a short-looking prompt can still be expensive.

Token density by content type (approx):
English prose: 25 tokens per 100 chars
Source code: 30-40 tokens per 100 chars
JSON/YAML: 35-45 tokens per 100 chars
Chinese/Korean: 50-80 tokens per 100 chars

The practical implication: if you're processing logs, code snippets, or API responses, your token cost is often higher than plain English text. Token awareness is a design discipline, not an afterthought.

What Are Embeddings

Once tokenized, each token is converted into an embedding - a numerical vector that captures meaning in context. Embeddings help the model understand relationships between concepts, such as similarity, intent, and dependency.

The context window is the amount of text the model can reason over at one time. Bigger context windows reduce truncation risk, but they also increase the ceiling for a single request if you fill them.

The cost ceiling for any request is still simple: Input Cost = Input Tokens x Input Price per Token. Output cost is separate and usually higher, because generated tokens are produced one by one.

For production systems, use provider token-count APIs rather than rough character estimates. Content type, tool calls, files, and schemas can all shift the real count.

How Tokenizers Work & Why They Matter

Each provider uses its own tokenizer. OpenAI models use the tiktoken family of Byte Pair Encoding tokenizers, while Anthropic models use their own proprietary tokenizer stack. Gemini and Grok each have their own provider-specific token handling as well. These systems are optimized for efficient text representation, but they do not always split the same prompt the same way.

That means the same English text can land in slightly different token counts across providers. The difference is usually small; the larger factor is the model tier and the output you ask it to generate.

Key insight: compare costs, not just token counts. Tokens x price per token is the real equation. The lab below uses a rough English heuristic, but actual token counts vary by tokenizer, content type, tools, files, and model behavior.

Both providers have a price tier strategy: cheap models for high-volume simple tasks, balanced models for day-to-day production work, and premium models for complex reasoning. Current routing patterns usually begin with Claude Haiku 3.5, GPT-5 mini, Gemini 3.1 Flash-Lite, or Grok 4.3, then escalate only when needed.

Claude's approach: Haiku 3.5 is the low-cost entry point, Sonnet 4 is the balanced tier, and Opus 4.1 is the premium tier with the highest output price.
OpenAI's approach (May 2026): GPT-5.5, GPT-5.2, GPT-5 mini, and GPT-4.1 cover the main reasoning and lower-cost routing tiers.
Gemini and Grok: Gemini 3.1 Pro handles deep long-context analysis, Gemini 3.1 Flash-Lite handles balanced throughput, and Grok 4.3 covers xAI's flagship route.

The key economic insight: route by task, not by model prestige. Compare your real workload rather than assuming a model label tells the full cost story.

The Token Cost Formula

Cost is the sum of input and output token costs. Output tokens are typically more expensive than input tokens because the model must generate them token-by-token.

Total Cost = (Input Tokens x Input Price) + (Output Tokens x Output Price)

Use the standalone TokenOps lab for live calculations with the current shared pricing object.

Live pricing example is rendered from the shared pricing object.

The ratio of output price to input price varies by provider and model tier. The practical implication: output budget is precious. If you don't need 500 tokens of output, ask for 200.

At Scale, Model Choice Multiplies

Single-request costs are small. At scale, they compound. Process 100,000 documents monthly, and the model you pick determines the shape of your bill.

Example: 100,000 requests x 500 input tokens each x 50 output tokens each

Batch totals are rendered from the shared pricing object.

The spread is wide enough that model selection is a product engineering decision, not just an API choice. A team can save meaningful budget simply by routing simple workloads to lower tiers.

Try the Standalone Lab

For browser-local calculations, model comparison, and planning without leaving the article, use the dedicated TokenOps lab.

Open TokenOps Lab: Launch the standalone cost engineering lab.

Cost Optimization Levers

  • Model Routing: Not all requests need your most expensive model. Try Claude Haiku 3.5, GPT-5 mini, Gemini 3.1 Flash-Lite, or Grok 4.3 first; escalate only if confidence is low.
  • Prompt Caching: If the same system prompt or context is reused across requests, enable prompt caching where supported. Cached tokens cost far less than fresh input tokens and should be on by default.
  • Output Budget: You pay for every output token. Don't ask for 500 tokens if 100 will do. Constrain `max_tokens` to what you actually need.
  • Batch APIs: All major providers now offer batch-style or async workflows for non-urgent tasks. For lower-priority workloads, batching is often the most economical route.

Code Examples: Token Counting & Cost Calculation

Basic token counter (JavaScript):

function estimateTokens(text) {
  // Rough heuristic: 1 token ~ 0.75 words in English text
  const words = text.trim().split(/\s+/).length;
  return Math.round(words / 0.75);
}

const prompt = "Explain machine learning in 100 words";
const tokens = estimateTokens(prompt);
console.log(`~${tokens} tokens`); // Output: ~47 tokens

Cost calculator (Python):

MODELS = {
  "claude-opus-4-1": {"input": 15.00, "output": 75.00},
  "gpt-5-2": {"input": 1.75, "output": 14.00},
  "gemini-3.1-pro": {"input": 2.00, "output": 12.00},
  "grok-4.3": {"input": 1.25, "output": 2.50},
}

def calculate_cost(model, input_tokens, output_tokens):
  rates = MODELS[model]
  cost = (input_tokens * rates["input"] +
        output_tokens * rates["output"]) / 1_000_000
  return round(cost, 6)

cost = calculate_cost("claude-opus-4-1", 2000, 500)
print(f"Cost: ${cost:.4f}") # Output: $0.0125