The invisible cost of wasted tokens and how to stop burning money with AI

How much do you spend per month on AI for coding? If you use Claude, GPT-4 or any frontier model via API, the answer is probably “more than you should”. And the worst part: most of that cost is invisible.

I’m not talking about fixed monthly plans. I’m talking about those using the API directly, whether via Claude Code, autonomous agents or CI/CD pipelines integrated with LLMs. In that scenario, every token counts, and most people are wasting between 30% and 50% of them.

Anatomy of waste

Wasted tokens come from four main sources:

1. Repeated context between sessions

You open a new session with your AI agent and explain: “I’m working on project X, which uses TypeScript, Prisma, the database is PostgreSQL, the API follows REST with JWT authentication…”. Next session, you repeat everything. And the next. And the next.

This repetitive briefing consumes between 500 and 2,000 tokens per session. Over 50 sessions per month, that’s 25,000 to 100,000 tokens spent saying the same thing.

2. Bloated tool responses

When an agent calls list_directory, it receives the entire project tree. When it calls read_file, it gets the complete file. Most of the time, the agent only needed a fraction of that information.

In a typical session with 20 tool calls, wasted tokens from irrelevant responses can reach 40,000 tokens. The agent processes everything but uses a fraction.

3. Context rot

Research from Chroma demonstrated that LLM performance drops significantly when the context window exceeds 32K tokens. The model doesn’t become “dumb”, it becomes distracted. With too much information in the window, attention dilutes and responses become less precise.

The paradox: you pay more tokens to get worse responses. Each irrelevant token doesn’t just cost money, it reduces the quality of the response.

4. Non-existent caching

If you ask “how does the authentication module work?” and then “explain the auth flow”, those are two semantically identical queries processed from scratch. Without semantic caching, the agent duplicates search, synthesis and response.

What it actually costs (real numbers)

Let’s do the math with current pricing:

Model	Price (input)	Price (output)
Claude Opus 4	$15/1M tokens	$75/1M tokens
Claude Sonnet 4	$3/1M tokens	$15/1M tokens
GPT-4o	$2.50/1M tokens	$10/1M tokens

An active developer using Claude Sonnet via API with 50 interactions/day, ~5K tokens per interaction:

Monthly consumption: ~7.5M tokens
Cost without waste: ~$22.50 (input) + ~$112.50 (output) = ~$135/month
With 40% waste: ~$189/month
Difference: ~$54/month per person

In a team of 5 developers, that’s $270/month thrown away. In a year, $3,240. And that’s with Sonnet — with Opus, multiply by 5x.

For autonomous agents that make dozens of tool calls per task, waste is even higher. An agent solving a bug can easily consume 50K tokens in tool calls that return irrelevant information.

Techniques that reduce waste

1. Persistent memory

Instead of repeating context every session, use a tool that persists decisions and patterns. MCP Context Hub, for example, stores project information and delivers only relevant chunks when the agent needs them. The 2,000-token briefing becomes a 500-token optimized context_pack.

Estimated savings: 50-70% reduction in repetitive context tokens.

2. Context compression

Instead of sending raw documentation, compress while keeping essential information. The technique isn’t new — the “LLMLingua” paper from Microsoft Research (EMNLP 2023) showed prompts can be compressed up to 20x without significant quality loss.

In practice, dual-path compression works well: rules for 80% of cases (fast, under 50ms) and local LLM for the complex 20%.

Estimated savings: 30-60% reduction in sent context size.

3. Progressive disclosure for tools

Instead of letting raw tool responses go directly into the agent’s context, use progressive detail levels. Start with a summary (100 tokens). If the agent needs more, it asks and receives the next level.

MCP Context Hub’s proxy_call implements this with 4 levels (L0-L3) and automatic expand-on-use. In practice, it saves between 60% and 90% of tokens spent on external tools.

Estimated savings: 60-90% reduction in tool response tokens.

4. Semantic caching

Store responses and recognize when a new question is semantically equivalent to one already answered. Doesn’t need exact matching — cosine similarity above 0.88 is sufficient for most cases.

Semantic caching transforms a 1.3-second search into a 1.6ms response. Beyond saving tokens, it saves time and API calls.

Estimated savings: 20-40% reduction in repetitive queries.

5. Session delta tracking

Track what information the agent has already received in the current session. If it already has context about authentication, don’t send it again when a related question comes up. Send only the delta.

Estimated savings: 30-50% reduction in long sessions.

6. Prompt engineering for token efficiency

The way you write prompts has a direct impact on token consumption. Consider these two prompts for the same task:

Verbose prompt (847 tokens):

I need you to look at my codebase and find all the places where we're
using the old authentication method. The old method uses session-based
auth with cookies. I want you to identify each file, show me the
relevant code, explain what it does, and then suggest how to migrate
it to JWT-based authentication. Please be thorough and check
everywhere including tests, middleware, and utility functions.

Optimized prompt (312 tokens):

Find all session-based auth usage in the codebase (cookies, session
stores, session middleware). For each: file path, relevant code
snippet, migration path to JWT. Include tests and middleware.

Both produce equivalent results, but the second uses 63% fewer input tokens. Over hundreds of interactions, this adds up. The key is removing filler words, redundant instructions, and over-explanation. The model doesn’t need politeness or narrative context to execute a technical task.

7. Model routing by task complexity

Not every task needs the most expensive model. A simple file rename or formatting fix doesn’t require Opus-level reasoning. In practice, I route tasks across three tiers:

Simple tasks (formatting, renaming, boilerplate): Haiku or GPT-4o-mini. Cost per task: ~$0.001
Medium tasks (bug fixes, test generation, code review): Sonnet or GPT-4o. Cost per task: ~$0.02
Complex tasks (architectural analysis, multi-file refactors): Opus. Cost per task: ~$0.15

Routing 60% of tasks to cheaper models reduces overall API spend by 40-50% with minimal quality impact. Several agent frameworks now support automatic model routing based on task classification.

Monitoring token usage in practice

You can’t optimize what you don’t measure. Here’s how I track token usage across different setups.

API-level tracking

If you’re calling the API directly, every response includes a usage object:

{
  "usage": {
    "input_tokens": 3847,
    "output_tokens": 1293,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 2100
  }
}

I log these to a simple SQLite database with timestamps and task labels. A weekly script generates a report showing: total tokens by day, average tokens per task type, cache hit rates, and cost breakdowns. This takes about 30 minutes to set up and saves thousands of dollars over time.

Agent-level tracking

For Claude Code sessions, I track usage through the session metadata. After each session, I record: task description, total tokens consumed, session duration, and whether the task succeeded. Over time, this data reveals which types of tasks are token-efficient and which ones consistently burn through context.

Team dashboards

For teams, I recommend aggregating token usage by developer and task type. Not for surveillance, but for identifying optimization opportunities. When one developer consistently uses 3x more tokens than another for similar tasks, there’s usually a prompt engineering or workflow difference that can be shared.

Setting budgets and alerts

Most API providers support usage limits, but they’re coarse-grained. For fine-grained control, I set soft limits at the application level: if a single agent session exceeds 100K tokens, log a warning. If it exceeds 200K, pause and ask for human confirmation. This prevents runaway sessions where the agent gets stuck in a loop and burns through tokens trying the same failing approach repeatedly.

A real example: I once had an agent stuck trying to fix a test that depended on a timezone-specific behavior. It tried 14 different approaches, consuming 180K tokens, before I noticed. A 100K token budget cap would have saved $12 and 45 minutes.

The compound effect

These techniques aren’t mutually exclusive. In practice, they compound:

Semantic cache intercepts repetitive queries (20-40% savings)
Session tracking removes already-sent chunks (30-50% more)
Compression reduces remaining chunk sizes (30-60% more)
Progressive disclosure optimizes tool responses (60-90%)

The compound effect reaches 80%+ total token savings. And that’s not a theoretical number — it’s what MCP Context Hub benchmarks demonstrate with real test corpus and open methodology.

A real cost comparison: before and after optimization

To illustrate the compound effect with concrete numbers, here’s data from a real project where I tracked token usage for a team of 3 developers over 4 weeks.

Before optimization (weeks 1-2):

Average daily tokens per developer: 280K
Average cost per developer per day: $4.20 (Sonnet)
Weekly team cost: $63
Common waste patterns: repeated project briefings (18%), bloated file reads (25%), redundant queries (12%)

After optimization (weeks 3-4, using Context Hub + prompt guidelines):

Average daily tokens per developer: 145K
Average cost per developer per day: $2.17
Weekly team cost: $32.55
Reduction: 48% fewer tokens, 48% lower cost

The quality metrics also improved. Task completion rate went from 73% (first attempt) to 84%. The agents were making better decisions with less but more relevant context. Fewer tokens, better results, lower cost. That’s the trifecta.

For teams using Opus for complex tasks, the savings are even more dramatic. One Opus-heavy workflow I optimized went from $890/month to $340/month just by implementing progressive disclosure and semantic caching. The developer didn’t change how they worked. The context layer changed how the agent consumed information.

Beyond cost: quality

The economic argument is compelling, but the impact on quality is even more important.

Context rot is real. A model receiving 50K tokens of context where only 10K are relevant produces worse responses than the same model receiving only the relevant 10K. This is documented in multiple papers and observable in daily practice.

Optimizing tokens isn’t just about saving money. It’s about giving the model ideal conditions to respond well. Less noise, more signal. Less distraction, more focus.

And for autonomous agents, this is critical. An agent with optimized context makes fewer mistakes, needs fewer corrections and completes tasks faster. The ROI isn’t just in saved tokens — it’s in time not spent fixing AI errors.

Start with the basics

If you don’t want to adopt a full optimization stack, start simple:

CLAUDE.md / .cursorrules: document project patterns and decisions. The agent reads it once and you don’t need to repeat yourself.
Be specific in prompts: “fix the bug in the authentication middleware at line 47 of auth.ts” uses fewer tokens than “there’s a bug in login, take a look”.
Break down large tasks: instead of “refactor the entire module”, split into sub-tasks. Each sub-task uses less context and produces better results.
Review before sending: if you’re pasting a 500-line file into the prompt, ask: “does the agent need all of this or just a specific section?”

These practices already significantly reduce waste. And when you want to go further, tools like MCP Context Hub automate the entire process.

The point is simple: tokens are finite, paid resources. Treating them with the same care you give API calls or database queries isn’t premature optimization. It’s basic software engineering.

References: