Saving Tokens in LLMs: Graphify, Caveman & Context-Mode

The biggest LLM cost isn't the model — it's wasted tokens. Graphify, caveman, and context-mode cut token usage 5–10× in real agentic workflows.

sky4me

The 200K-token context window was supposed to fix this. Then the 1M window. Then the 2M window. And yet anyone running a real agentic workflow knows the same thing happens every time: forty minutes in, the model starts forgetting what file it was editing, repeating tool calls it already made, and writing summaries that ignore the part you actually asked about.

Bigger windows didn’t make agents cheaper. They made agents lazier with context — and lazier context loses you correctness, latency, and money in roughly that order.

The interesting question isn’t “how big is the window.” It’s “how much of what’s in the window is actually load-bearing.” On most agent runs the answer is depressing: 30–50% raw tool output, 10–20% ceremonial prose, 10% re-reads of files the agent has already seen, 10% stale conversation history. The signal-to-noise ratio collapses and the model starts hallucinating to fill the gaps.

Three open-source tools address this directly: graphify compresses codebase reads 71×, caveman cuts prose output 75%, and context-mode sandboxes tool I/O for a 98% reduction. Used together, they deliver 5–10× total token savings on real agentic workflows without changing the underlying model.

Where the tokens actually go

A few representative offenders, measured on real sessions:

SourceTypical sizeWhat it should be
Playwright DOM snapshot56 KBA summary + selector you actually need
npm install log tail45 KBLast 20 lines, or just the exit code
ls -R on a mid-size repo30–80 KBA scoped tree, or a find with a glob
GitHub issue dump (20 issues)59 KBTitle + status + 1-line summary
Re-read of a 600-line file the agent already saw12 KBNothing. Use the previous read.
”Sure! I’d be happy to help with that…” prose50–200 tokens per turnDrop it.

None of this is the model’s fault. Each individual decision — read the file, list the directory, run the test — is correct. The bug is architectural: every byte that any tool emits ends up in the conversation forever, with no compression boundary anywhere.

The eight levers

Industry research and benchmarks (Vercel AI SDK evals, Claude prompt-cache reports, LLMLingua paper, agent-trace analyses) converge on roughly the same eight techniques. Most working agent setups use two or three. The teams getting real cost wins use most of them.

LeverMechanismTypical savings
Prompt cachingcache_control on stable system prompts and large repo indexes~90% on repeat turns
Sub-agent handoffStrong model writes a 2K-token plan; cheap model executes50–70%
GrepRAG over vector RAGLexical search beats embeddings for code; iterative discoveryAvoids ~15% accuracy drop in saturated windows
Tool output sandboxingRun heavy tools in a subprocess; return summary, not stdout10K–50K tokens per heavy call
On-demand skill loadingSkills load only when the trigger matchesKeeps base prompt < 1K tokens
Sliding-window historyLast 5–10 turns raw; older turns compressed to a “context bridge”50–70% history weight
Structured outputsXML / JSON tags wrap edits; model stops narrating20–30% on output
LLMLingua-style compressionSmall model prunes low-perplexity tokens from promptsUp to 20× on long documents

A practical note on the RAG question. For codebases under ~200K tokens, full-context loading is fine and often more accurate than chunking — agentic search with grep plus targeted reads beats vector retrieval almost every time. Above 200K, chunking becomes mandatory, but switch to lexical-first RAG and load “working sets” (the files you actually edit) into context. Vector embeddings have a real “lost in the middle” problem with code that pure grep doesn’t.

The next three sections are the tools I actually use. Each one attacks a different lever from the table.

graphify — the “stop re-reading files” lever

Graphify is a Claude Code skill that takes any folder — code, PDFs, markdown, screenshots, whiteboard photos — and builds a persistent knowledge graph you query instead of re-reading the raw files.

How it works internally is the interesting part. For code, it uses tree-sitter to extract an AST and a call-graph pass for symbols and dependencies. For documents and papers, it asks Claude to extract concepts and relationships. For images, it uses Claude vision — a screenshot of a whiteboard becomes nodes and edges like any other input. All inputs feed into a NetworkX graph, which is partitioned into communities via the Leiden algorithm (graspologic implementation), then serialized to graph.json plus a vis.js HTML visualizer.

The headline number is 71.5× fewer tokens per query versus reading the raw files, measured on a mixed corpus of Karpathy’s repos plus five papers and four images. Token reduction scales with corpus size — at 6 files it’s roughly 1×, at 52 files it’s 71×, at hundreds it’s higher.

Three details worth knowing:

  • --watch mode keeps the graph in sync as files change. Code edits trigger an instant AST-only rebuild (no LLM call); doc and image changes notify you to run --update. Useful when you have multiple agents editing in parallel.
  • --wiki mode generates Wikipedia-style markdown articles per community plus an index.md entry point. Point any agent at index.md and it can navigate the knowledge base by reading files instead of parsing JSON. This is the part that makes graphify agent-native rather than just a visualization tool.
  • Edge honesty. Every relationship is tagged EXTRACTED, INFERRED, or AMBIGUOUS. You always know what was found in the source vs. what the LLM guessed. This is rarer than it should be.

The trap: INFERRED edges are guesses. Treat them like a junior dev’s reading notes — useful for navigation, not for correctness claims.

caveman — the “stop writing ceremonial prose” lever

Caveman is a Claude Code skill that compresses the agent’s own output prose by roughly 75% with no loss of technical content. The skill itself is about 1.6 KB of markdown; it overrides the model’s default chat register for the rest of the session.

The rules are short. Drop articles (a/an/the). Drop filler (“just”, “really”, “basically”, “actually”). Drop pleasantries (“Sure!”, “Of course”, “Happy to help”). Drop hedging. Fragments are fine. Use short synonyms (big not extensive, fix not implement a solution for). Technical terms stay exact. Code blocks, error messages, function names: never abbreviated.

Six intensity levels:

LevelBehavior
liteNo filler or hedging; full sentences kept
fullDrop articles, fragments OK, short synonyms — the default
ultraAbbreviate prose words (DB/auth/config/req/res), arrows for causality
wenyan-liteSemi-classical Chinese register, drops filler
wenyan-fullFull 文言文 — 80–90% character reduction
wenyan-ultraMaximum classical-Chinese terseness

The auto-clarity guard is what makes it actually shippable. The skill explicitly drops compression for security warnings, irreversible action confirmations, and multi-step sequences where omitted conjunctions risk misread. Code, commits, and PRs are written in normal English regardless of mode. Compression is for prose only, never for instructions where ordering matters.

The trap: even in full mode, fragment-heavy multi-step instructions can introduce ambiguity (“migrate table drop column backup first” — order unclear without conjunctions). The skill catches the obvious cases; you still need to read your own output before pasting it into a destructive workflow.

context-mode — the “stop dumping raw tool output” lever

Context-mode is an MCP server (ELv2 license) that solves the loudest token leak in modern agent setups: every tool call dumping its full stdout into the conversation forever.

The mechanism: PreToolUse hooks intercept calls to shell, file reads, and web-fetch. The work runs in a sandboxed subprocess. Stdout is indexed into a SQLite FTS5 store. The model gets a structured summary plus section titles back in context — not the raw bytes. To retrieve specifics, the model issues a ctx_search query and gets BM25-scored snippets. The headline measurement: 315 KB of raw output collapses to 5.4 KB in context. 98% reduction.

Six tools ship with the server:

  • ctx_execute — run a single command, indexed
  • ctx_batch_execute — run many commands plus search queries in one round trip (the workhorse — replaces 30+ individual tool calls in a typical research phase)
  • ctx_search — query the indexed knowledge base
  • ctx_fetch_and_index — pull a URL, convert HTML to markdown, index it
  • ctx_index — index arbitrary content
  • ctx_execute_file — execute a script file in the sandbox

The session-continuity layer is the part most people miss. A PreCompact hook persists every file edit, git operation, task, and user decision to the same SQLite store. When the conversation compacts, context-mode doesn’t dump that data back into the prompt — it leaves it in the index and lets BM25 retrieve only what’s relevant for the next turn. The model picks up exactly where it left off, without paying the full-history token cost.

The trap: sandboxing is the right default, but a one-line error message that should hit context now requires an extra retrieval step. Tune the matcher so genuinely small outputs pass through directly.

Stack them

Each of these tools attacks a different lever. They compose:

ToolLeverBallpark savings
graphifyRe-reads of files / docs / images71.5× per query on mixed corpus
cavemanCeremonial prose in agent output~75% on prose tokens
context-modeRaw tool output in conversation~98% on tool I/O

There’s no overlap. Run all three together and the savings multiply: tool I/O gets sandboxed before it ever touches context, codebase memory lives in a graph the agent queries on demand, and the prose the model writes is compressed by default. Most teams I work with hit a 5–10× total token reduction without changing the underlying model or workflow.

The next discipline

Type safety became table-stakes once the cost of undefined is not a function got too high to ignore. Test coverage became table-stakes once “ship and pray” stopped scaling. Token economy is the next one — once you’ve watched a pnpm install log eat 40% of your context window mid-session, you stop trusting setups that don’t have a sandbox boundary.

The teams who internalize this ship faster agents, at lower cost, with fewer “the model forgot” incidents. The tools to do it are open-source, MCP-compatible, and installable in under five minutes each.


If you’re running an agent setup that feels expensive, slow, or forgetful — and you’re not sure which of the eight levers is the one to pull first — that’s a conversation I’m happy to have.