Back to AI
AI

10 strategies I use to slash token usage without compromising quality and reliability

After months of building complex, production-grade systems with Claude Code, I've cut token consumption by 60-80% while improving output quality, reliability, and speed.

May 7, 2026 By Nitin Motgi 14 min read
10 strategies I use to slash token usage without compromising quality and reliability AI May 7, 2026 14 min /ai/claude-code-lessons/ After months of building production-grade systems with Claude Code, here are the hard-won practices that have cut token consumption by 60-80% while improving output quality and reliability.

After months of building complex, production-grade systems with Claude Code, I've accumulated a set of hard-won practices that have dramatically reduced my token consumption—often by 60–80%—while improving output quality, reliability, and development speed. These aren't theoretical tips; they're daily drivers that let me tackle ambitious projects without burning through credits or hitting frustrating context limits.

Here are the key lessons, complete with explanations, exact configurations, and real-world examples.

Ditch the 1M context window—stick to 200k

The biggest mistake I see (and made myself) is defaulting to the full 1M context. Don't do it.

Set these environment variables immediately:

bash
CLAUDE_CODE_DISABLE_1M_CONTEXT=1
CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=80

Why this matters

200K is plenty. For almost every coding task—debugging, feature implementation, refactoring—200K provides more than enough relevant history without the noise and dilution that comes with 1M. The larger window invites problems: it encourages the model to "remember" irrelevant details from hours ago, leading to context drift, repetitive suggestions, and higher hallucination rates. It also costs significantly more per request.

With the autocompact override set to 80%, Claude automatically compacts around the 75–80% mark of your 200K window, keeping things running smoothly without sudden "context full" errors.

The power of manual /compact

Don't wait for autocompact. Proactively run /compact when your context hits 50–60%. It forces a clean summary of what actually matters right now, removes conversational dead weight (old failed attempts, tangential discussions), keeps the model's attention laser-focused on the current sub-problem, and prevents the slow degradation in reasoning quality that happens as context fills with cruft.

After a successful /compact, immediately follow up with a one-sentence recap of your goal. This re-anchors the session perfectly.

workflow
/compact
Current goal: Implement user authentication with JWT refresh tokens in the Express backend.

Use /clear religiously when starting a new problem

This is non-negotiable. The moment you shift to a meaningfully different problem—new feature, different bug, architecture decision—type /clear.

Previous context from an unrelated task pollutes the new one. The model starts thinking in the wrong frame of reference, and you waste tokens making it unlearn old assumptions. Breaking work into small, focused problems with /clear between them is the foundation of efficient Claude Code usage. It turns one giant, expensive session into many crisp, high-signal ones.

Break work into parallel phases and spawn sub-agents liberally

Once you've decomposed a feature into independent pieces, don't do everything sequentially in the parent session. Spawn sub-agents for each parallel phase—one for the API layer, one for the frontend components, one for tests and validation, one for documentation.

Sub-agents run with their own context windows, which massively reduces load on your main (parent) context. The parent only needs high-level coordination and final integration review. In practice, running 4–6 in parallel yields enormous token savings and speed gains.

example decomposition
Main task: Build real-time collaborative whiteboard
- Sub-agent 1: WebSocket server + Redis pub/sub
- Sub-agent 2: Canvas drawing logic + CRDT conflict resolution
- Sub-agent 3: User presence and cursor synchronization
- Sub-agent 4: Permission system and access control

Each sub-agent gets a tight, focused prompt and a /clear at the start.

Delegate aggressively with /ask [model]

Don't be a hero—distribute the load across different LLMs. With custom skills you can do things like:

delegation examples
/ask cursor Implement the React hook for optimistic updates
/ask gemini Write the SQL migration for the new audit log table
/ask codex Generate the OpenAPI spec from the Express routes

This spreads token usage across providers (avoiding rate limits on any one model) and plays to each model's strengths. Keeps your main Claude session clean and focused on orchestration. The more precise the delegated request, the better the result.

Never change models mid-session

This seems obvious but is easy to forget. Once you start a session with Sonnet (or whatever model), stay with it until the task is complete or you /clear. Model switches can reset subtle behavioral patterns, cause the new model to misinterpret accumulated context, and waste tokens on re-explaining things. Consistency wins.

Vague requests are token vampires—be ruthlessly specific

This is the difference between mediocre and exceptional results. Compare these two requests:

Vague: "Fix the login bug"

Specific: "In auth/login.ts, the handleLogin function (lines 42–67) fails silently when the user enters an email with uppercase letters. The backend expects lowercase. Add client-side normalization using .toLowerCase().trim(), show a clear error toast 'Email must be lowercase', and ensure the request still succeeds after normalization. Also add a unit test for mixed-case emails."

Specificity wins because the model doesn't waste tokens guessing what you want, you get exactly the behavior you need on the first try, and fewer follow-up correction rounds means massive token savings. If your request could be interpreted multiple ways, rewrite it until there's only one possible interpretation.

Install caveman to make claude talk like a caveman (65%+ token savings)

This one changed everything. The Caveman plugin forces Claude (and many other agents) to respond in ultra-terse language—removing articles, filler words, pleasantries, and hedging while keeping 100% technical accuracy.

Benchmarks show roughly 65% average token reduction, up to 87% on React debugging tasks, and approximately 3x faster responses with dramatically cleaner, more actionable output.

install
claude plugin marketplace add JuliusBrussee/caveman
claude plugin install caveman@caveman

# Then activate
/caveman
# or
/caveman ultra

You also get specialized commands: /caveman-commit for perfect conventional commit messages, /caveman-review for one-line PR feedback, and /caveman:compress CLAUDE.md which shrinks your rules file by roughly 46%.

The transformation in practice

Here's what the difference looks like on a real response about unnecessary re-renders:

normal mode
"I notice that you're creating a new object reference on every render in your
component. This is causing unnecessary re-renders because React sees it as a
new prop. You should wrap it in useMemo to stabilize the reference..."
caveman mode
"New object ref each render. Inline object prop = new ref = re-render. Wrap in useMemo."

Same information. 80% fewer tokens. Faster. Better. Running Caveman in full mode for almost everything is worth it—the output is addictive once you get used to it.

Keep sonnet effort at medium—rarely use xhigh or max

For Sonnet specifically, medium effort is the sweet spot for most coding work. Higher effort modes often produce more verbose output, over-engineering, diminishing returns on quality, and higher token burn. Medium gives excellent reasoning with crisp, focused responses. Save the higher settings for truly novel architectural problems—and even then, staying on medium often produces better results.

Stop pasting screenshots—describe the problem or use agent-browser

This was a huge shift in the workflow. Instead of pasting a screenshot with "What's wrong here?", either describe precisely in text or use a purpose-built tool like vercel-labs/agent-browser.

The text-first approach forces you to actually diagnose the problem before asking, which often means you find it yourself. When you do need AI help, a precise description beats an image every time—no visual ambiguity, no token cost for image processing, and Claude can reason about the exact state described.

The agent-browser approach takes this further. It uses accessibility tree snapshots with stable element references (@e1, @e2) instead of screenshots, deterministic interaction for reliable click/type/inspect without visual guessing, and a full debugging toolkit—console logs, network requests, React component tree, styles—all queryable via text commands. This makes frontend debugging with Claude dramatically more reliable and token-efficient.

example
npx skills add vercel-labs/agent-browser
# Then in Claude:
/agent-browser chat "Navigate to the dashboard and tell me what the console errors are when I click 'Export CSV'"

You get structured, machine-readable output instead of "it looks like there's a red error somewhere."

Use code-review-graph for ast-powered, minimal-context reviews

The code-review-graph tool builds a persistent, local knowledge graph of your entire codebase using Tree-sitter AST parsing. Instead of Claude reading 5–10 full files (thousands of tokens) for every review, it parses your code into a graph of functions, classes, imports, calls, and inheritance; tracks blast radius (exactly which functions and tests are affected by a change); and delivers minimal targeted context—roughly 100 tokens instead of 5,000+—updating incrementally in under two seconds on every file change.

Real results from using this: 6.8x fewer tokens on code reviews, up to 49x fewer tokens on daily coding tasks, and 100% recall on impact analysis—it never misses affected code.

setup
pip install code-review-graph
code-review-graph install --platform claude-code
code-review-graph build

Then use commands like /code-review-graph:review-pr, code-review-graph detect-changes, and semantic search across your architecture.

When you ask Claude to review a PR, it no longer needs to re-read the entire monorepo. It queries the graph and gets precisely the relevant slices. Context window stays clean, costs plummet, and reviews become faster and more accurate. This is now mandatory infrastructure for any serious Claude Code project.

Final thoughts: measure, iterate, and build systems that last

These ten practices have transformed how I work with Claude Code. Token burn is no longer the limiting factor—it's now creativity and problem decomposition. That said, despite all this care, I still run out of tokens. How fast I get back to burning on demand has been improving—but I am not yet happy with where I am, and I will keep figuring out how to work more efficiently.

The current daily ritual: start with /clear and Caveman mode; decompose and spawn sub-agents where possible; delegate aggressively with /ask; compact proactively at 50–60%; use code-review-graph for any code inspection; describe problems or use agent-browser instead of images; stay on Sonnet at medium effort; and be obsessively specific in every prompt.

The result: features that used to take 3–4 expensive sessions now complete in a single focused, low-cost session—and the code is more reliable because the context was always clean and targeted.

If you're serious about scaling your Claude Code usage, start with the context settings and Caveman installation today. The ROI is immediate.

What are your favorite Claude Code optimizations? Drop them in the comments—I'm always stealing good ideas.