- AI
- tooling
- Claude-Code
- tokens
- context-window
- MCP
Saving Tokens in LLMs: Graphify, Caveman & Context-Mode
The biggest LLM cost isn't the model — it's wasted tokens. Graphify, caveman, and context-mode cut token usage 5–10× in real agentic workflows.
The 200K-token context window was supposed to fix this. Then the 1M window. Then the 2M window. And yet anyone running a real agentic workflow knows the same thing happens every time: forty minutes in, the model starts forgetting what file it was editing, repeating tool calls it already made, and writing summaries that ignore the part you actually asked about.
Bigger windows didn’t make agents cheaper. They made agents lazier with context — and lazier context loses you correctness, latency, and money in roughly that order.
The interesting question isn’t “how big is the window.” It’s “how much of what’s in the window is actually load-bearing.” On most agent runs the answer is depressing: 30–50% raw tool output, 10–20% ceremonial prose, 10% re-reads of files the agent has already seen, 10% stale conversation history. The signal-to-noise ratio collapses and the model starts hallucinating to fill the gaps.
Three open-source tools address this directly: graphify compresses codebase reads 71×, caveman cuts prose output 75%, and context-mode sandboxes tool I/O for a 98% reduction. Used together, they deliver 5–10× total token savings on real agentic workflows without changing the underlying model.
Where the tokens actually go
A few representative offenders, measured on real sessions:
| Source | Typical size | What it should be |
|---|---|---|
| Playwright DOM snapshot | 56 KB | A summary + selector you actually need |
npm install log tail | 45 KB | Last 20 lines, or just the exit code |
ls -R on a mid-size repo | 30–80 KB | A scoped tree, or a find with a glob |
| GitHub issue dump (20 issues) | 59 KB | Title + status + 1-line summary |
| Re-read of a 600-line file the agent already saw | 12 KB | Nothing. Use the previous read. |
| ”Sure! I’d be happy to help with that…” prose | 50–200 tokens per turn | Drop it. |
None of this is the model’s fault. Each individual decision — read the file, list the directory, run the test — is correct. The bug is architectural: every byte that any tool emits ends up in the conversation forever, with no compression boundary anywhere.
The eight levers
Industry research and benchmarks (Vercel AI SDK evals, Claude prompt-cache reports, LLMLingua paper, agent-trace analyses) converge on roughly the same eight techniques. Most working agent setups use two or three. The teams getting real cost wins use most of them.
| Lever | Mechanism | Typical savings |
|---|---|---|
| Prompt caching | cache_control on stable system prompts and large repo indexes | ~90% on repeat turns |
| Sub-agent handoff | Strong model writes a 2K-token plan; cheap model executes | 50–70% |
| GrepRAG over vector RAG | Lexical search beats embeddings for code; iterative discovery | Avoids ~15% accuracy drop in saturated windows |
| Tool output sandboxing | Run heavy tools in a subprocess; return summary, not stdout | 10K–50K tokens per heavy call |
| On-demand skill loading | Skills load only when the trigger matches | Keeps base prompt < 1K tokens |
| Sliding-window history | Last 5–10 turns raw; older turns compressed to a “context bridge” | 50–70% history weight |
| Structured outputs | XML / JSON tags wrap edits; model stops narrating | 20–30% on output |
| LLMLingua-style compression | Small model prunes low-perplexity tokens from prompts | Up to 20× on long documents |
A practical note on the RAG question. For codebases under ~200K tokens, full-context loading is fine and often more accurate than chunking — agentic search with grep plus targeted reads beats vector retrieval almost every time. Above 200K, chunking becomes mandatory, but switch to lexical-first RAG and load “working sets” (the files you actually edit) into context. Vector embeddings have a real “lost in the middle” problem with code that pure grep doesn’t.
The next three sections are the tools I actually use. Each one attacks a different lever from the table.
graphify — the “stop re-reading files” lever
Graphify is a Claude Code skill that takes any folder — code, PDFs, markdown, screenshots, whiteboard photos — and builds a persistent knowledge graph you query instead of re-reading the raw files.
How it works internally is the interesting part. For code, it uses tree-sitter to extract an AST and a call-graph pass for symbols and dependencies. For documents and papers, it asks Claude to extract concepts and relationships. For images, it uses Claude vision — a screenshot of a whiteboard becomes nodes and edges like any other input. All inputs feed into a NetworkX graph, which is partitioned into communities via the Leiden algorithm (graspologic implementation), then serialized to graph.json plus a vis.js HTML visualizer.
The headline number is 71.5× fewer tokens per query versus reading the raw files, measured on a mixed corpus of Karpathy’s repos plus five papers and four images. Token reduction scales with corpus size — at 6 files it’s roughly 1×, at 52 files it’s 71×, at hundreds it’s higher.
Three details worth knowing:
--watchmode keeps the graph in sync as files change. Code edits trigger an instant AST-only rebuild (no LLM call); doc and image changes notify you to run--update. Useful when you have multiple agents editing in parallel.--wikimode generates Wikipedia-style markdown articles per community plus anindex.mdentry point. Point any agent atindex.mdand it can navigate the knowledge base by reading files instead of parsing JSON. This is the part that makes graphify agent-native rather than just a visualization tool.- Edge honesty. Every relationship is tagged
EXTRACTED,INFERRED, orAMBIGUOUS. You always know what was found in the source vs. what the LLM guessed. This is rarer than it should be.
The trap: INFERRED edges are guesses. Treat them like a junior dev’s reading notes — useful for navigation, not for correctness claims.
caveman — the “stop writing ceremonial prose” lever
Caveman is a Claude Code skill that compresses the agent’s own output prose by roughly 75% with no loss of technical content. The skill itself is about 1.6 KB of markdown; it overrides the model’s default chat register for the rest of the session.
The rules are short. Drop articles (a/an/the). Drop filler (“just”, “really”, “basically”, “actually”). Drop pleasantries (“Sure!”, “Of course”, “Happy to help”). Drop hedging. Fragments are fine. Use short synonyms (big not extensive, fix not implement a solution for). Technical terms stay exact. Code blocks, error messages, function names: never abbreviated.
Six intensity levels:
| Level | Behavior |
|---|---|
lite | No filler or hedging; full sentences kept |
full | Drop articles, fragments OK, short synonyms — the default |
ultra | Abbreviate prose words (DB/auth/config/req/res), arrows for causality |
wenyan-lite | Semi-classical Chinese register, drops filler |
wenyan-full | Full 文言文 — 80–90% character reduction |
wenyan-ultra | Maximum classical-Chinese terseness |
The auto-clarity guard is what makes it actually shippable. The skill explicitly drops compression for security warnings, irreversible action confirmations, and multi-step sequences where omitted conjunctions risk misread. Code, commits, and PRs are written in normal English regardless of mode. Compression is for prose only, never for instructions where ordering matters.
The trap: even in full mode, fragment-heavy multi-step instructions can introduce ambiguity (“migrate table drop column backup first” — order unclear without conjunctions). The skill catches the obvious cases; you still need to read your own output before pasting it into a destructive workflow.
context-mode — the “stop dumping raw tool output” lever
Context-mode is an MCP server (ELv2 license) that solves the loudest token leak in modern agent setups: every tool call dumping its full stdout into the conversation forever.
The mechanism: PreToolUse hooks intercept calls to shell, file reads, and web-fetch. The work runs in a sandboxed subprocess. Stdout is indexed into a SQLite FTS5 store. The model gets a structured summary plus section titles back in context — not the raw bytes. To retrieve specifics, the model issues a ctx_search query and gets BM25-scored snippets. The headline measurement: 315 KB of raw output collapses to 5.4 KB in context. 98% reduction.
Six tools ship with the server:
ctx_execute— run a single command, indexedctx_batch_execute— run many commands plus search queries in one round trip (the workhorse — replaces 30+ individual tool calls in a typical research phase)ctx_search— query the indexed knowledge basectx_fetch_and_index— pull a URL, convert HTML to markdown, index itctx_index— index arbitrary contentctx_execute_file— execute a script file in the sandbox
The session-continuity layer is the part most people miss. A PreCompact hook persists every file edit, git operation, task, and user decision to the same SQLite store. When the conversation compacts, context-mode doesn’t dump that data back into the prompt — it leaves it in the index and lets BM25 retrieve only what’s relevant for the next turn. The model picks up exactly where it left off, without paying the full-history token cost.
The trap: sandboxing is the right default, but a one-line error message that should hit context now requires an extra retrieval step. Tune the matcher so genuinely small outputs pass through directly.
Stack them
Each of these tools attacks a different lever. They compose:
| Tool | Lever | Ballpark savings |
|---|---|---|
| graphify | Re-reads of files / docs / images | 71.5× per query on mixed corpus |
| caveman | Ceremonial prose in agent output | ~75% on prose tokens |
| context-mode | Raw tool output in conversation | ~98% on tool I/O |
There’s no overlap. Run all three together and the savings multiply: tool I/O gets sandboxed before it ever touches context, codebase memory lives in a graph the agent queries on demand, and the prose the model writes is compressed by default. Most teams I work with hit a 5–10× total token reduction without changing the underlying model or workflow.
The next discipline
Type safety became table-stakes once the cost of undefined is not a function got too high to ignore. Test coverage became table-stakes once “ship and pray” stopped scaling. Token economy is the next one — once you’ve watched a pnpm install log eat 40% of your context window mid-session, you stop trusting setups that don’t have a sandbox boundary.
The teams who internalize this ship faster agents, at lower cost, with fewer “the model forgot” incidents. The tools to do it are open-source, MCP-compatible, and installable in under five minutes each.
If you’re running an agent setup that feels expensive, slow, or forgetful — and you’re not sure which of the eight levers is the one to pull first — that’s a conversation I’m happy to have.