Saving Tokens in LLMs: Graphify, Caveman & Context-Mode

The 200K-token context window was supposed to fix this. Then the 1M window. Then the 2M window. And yet anyone running a real agentic workflow knows the same thing happens every time: forty minutes in, the model starts forgetting what file it was editing, repeating tool calls it already made, and writing summaries that ignore the part you actually asked about.

Bigger windows didn’t make agents cheaper. They made agents lazier with context — and lazier context loses you correctness, latency, and money in roughly that order.

The interesting question isn’t “how big is the window.” It’s “how much of what’s in the window is actually load-bearing.” On most agent runs the answer is depressing: 30–50% raw tool output, 10–20% ceremonial prose, 10% re-reads of files the agent has already seen, 10% stale conversation history. The signal-to-noise ratio collapses and the model starts hallucinating to fill the gaps.

Three open-source tools address this directly: graphify compresses codebase reads 71×, caveman cuts prose output 75%, and context-mode sandboxes tool I/O for a 98% reduction. Used together, they deliver 5–10× total token savings on real agentic workflows without changing the underlying model.

Where the tokens actually go

A few representative offenders, measured on real sessions:

Source	Typical size	What it should be
Playwright DOM snapshot	56 KB	A summary + selector you actually need
`npm install` log tail	45 KB	Last 20 lines, or just the exit code
`ls -R` on a mid-size repo	30–80 KB	A scoped tree, or a `find` with a glob
GitHub issue dump (20 issues)	59 KB	Title + status + 1-line summary
Re-read of a 600-line file the agent already saw	12 KB	Nothing. Use the previous read.
”Sure! I’d be happy to help with that…” prose	50–200 tokens per turn	Drop it.

None of this is the model’s fault. Each individual decision — read the file, list the directory, run the test — is correct. The bug is architectural: every byte that any tool emits ends up in the conversation forever, with no compression boundary anywhere.

The eight levers

Industry research and benchmarks (Vercel AI SDK evals, Claude prompt-cache reports, LLMLingua paper, agent-trace analyses) converge on roughly the same eight techniques. Most working agent setups use two or three. The teams getting real cost wins use most of them.

Lever	Mechanism	Typical savings
Prompt caching	`cache_control` on stable system prompts and large repo indexes	~90% on repeat turns
Sub-agent handoff	Strong model writes a 2K-token plan; cheap model executes	50–70%
GrepRAG over vector RAG	Lexical search beats embeddings for code; iterative discovery	Avoids ~15% accuracy drop in saturated windows
Tool output sandboxing	Run heavy tools in a subprocess; return summary, not stdout	10K–50K tokens per heavy call
On-demand skill loading	Skills load only when the trigger matches	Keeps base prompt < 1K tokens
Sliding-window history	Last 5–10 turns raw; older turns compressed to a “context bridge”	50–70% history weight
Structured outputs	XML / JSON tags wrap edits; model stops narrating	20–30% on output
LLMLingua-style compression	Small model prunes low-perplexity tokens from prompts	Up to 20× on long documents

A practical note on the RAG question. For codebases under ~200K tokens, full-context loading is fine and often more accurate than chunking — agentic search with grep plus targeted reads beats vector retrieval almost every time. Above 200K, chunking becomes mandatory, but switch to lexical-first RAG and load “working sets” (the files you actually edit) into context. Vector embeddings have a real “lost in the middle” problem with code that pure grep doesn’t.

The next three sections are the tools I actually use. Each one attacks a different lever from the table.

graphify — the “stop re-reading files” lever

Graphify is a Claude Code skill that takes any folder — code, PDFs, markdown, screenshots, whiteboard photos — and builds a persistent knowledge graph you query instead of re-reading the raw files.

How it works internally is the interesting part. For code, it uses tree-sitter to extract an AST and a call-graph pass for symbols and dependencies. For documents and papers, it asks Claude to extract concepts and relationships. For images, it uses Claude vision — a screenshot of a whiteboard becomes nodes and edges like any other input. All inputs feed into a NetworkX graph, which is partitioned into communities via the Leiden algorithm (graspologic implementation), then serialized to graph.json plus a vis.js HTML visualizer.

The headline number is 71.5× fewer tokens per query versus reading the raw files, measured on a mixed corpus of Karpathy’s repos plus five papers and four images. Token reduction scales with corpus size — at 6 files it’s roughly 1×, at 52 files it’s 71×, at hundreds it’s higher.

Three details worth knowing:

--watch mode keeps the graph in sync as files change. Code edits trigger an instant AST-only rebuild (no LLM call); doc and image changes notify you to run --update. Useful when you have multiple agents editing in parallel.
--wiki mode generates Wikipedia-style markdown articles per community plus an index.md entry point. Point any agent at index.md and it can navigate the knowledge base by reading files instead of parsing JSON. This is the part that makes graphify agent-native rather than just a visualization tool.
Edge honesty. Every relationship is tagged EXTRACTED, INFERRED, or AMBIGUOUS. You always know what was found in the source vs. what the LLM guessed. This is rarer than it should be.

The trap: INFERRED edges are guesses. Treat them like a junior dev’s reading notes — useful for navigation, not for correctness claims.

caveman — the “stop writing ceremonial prose” lever

Caveman is a Claude Code skill that compresses the agent’s own output prose by roughly 75% with no loss of technical content. The skill itself is about 1.6 KB of markdown; it overrides the model’s default chat register for the rest of the session.

The rules are short. Drop articles (a/an/the). Drop filler (“just”, “really”, “basically”, “actually”). Drop pleasantries (“Sure!”, “Of course”, “Happy to help”). Drop hedging. Fragments are fine. Use short synonyms (big not extensive, fix not implement a solution for). Technical terms stay exact. Code blocks, error messages, function names: never abbreviated.

Six intensity levels:

Level	Behavior
`lite`	No filler or hedging; full sentences kept
`full`	Drop articles, fragments OK, short synonyms — the default
`ultra`	Abbreviate prose words (DB/auth/config/req/res), arrows for causality
`wenyan-lite`	Semi-classical Chinese register, drops filler
`wenyan-full`	Full 文言文 — 80–90% character reduction
`wenyan-ultra`	Maximum classical-Chinese terseness

The auto-clarity guard is what makes it actually shippable. The skill explicitly drops compression for security warnings, irreversible action confirmations, and multi-step sequences where omitted conjunctions risk misread. Code, commits, and PRs are written in normal English regardless of mode. Compression is for prose only, never for instructions where ordering matters.

The trap: even in full mode, fragment-heavy multi-step instructions can introduce ambiguity (“migrate table drop column backup first” — order unclear without conjunctions). The skill catches the obvious cases; you still need to read your own output before pasting it into a destructive workflow.

context-mode — the “stop dumping raw tool output” lever

Context-mode is an MCP server (ELv2 license) that solves the loudest token leak in modern agent setups: every tool call dumping its full stdout into the conversation forever.

The mechanism: PreToolUse hooks intercept calls to shell, file reads, and web-fetch. The work runs in a sandboxed subprocess. Stdout is indexed into a SQLite FTS5 store. The model gets a structured summary plus section titles back in context — not the raw bytes. To retrieve specifics, the model issues a ctx_search query and gets BM25-scored snippets. The headline measurement: 315 KB of raw output collapses to 5.4 KB in context. 98% reduction.

Six tools ship with the server:

ctx_execute — run a single command, indexed
ctx_batch_execute — run many commands plus search queries in one round trip (the workhorse — replaces 30+ individual tool calls in a typical research phase)
ctx_search — query the indexed knowledge base
ctx_fetch_and_index — pull a URL, convert HTML to markdown, index it
ctx_index — index arbitrary content
ctx_execute_file — execute a script file in the sandbox

The session-continuity layer is the part most people miss. A PreCompact hook persists every file edit, git operation, task, and user decision to the same SQLite store. When the conversation compacts, context-mode doesn’t dump that data back into the prompt — it leaves it in the index and lets BM25 retrieve only what’s relevant for the next turn. The model picks up exactly where it left off, without paying the full-history token cost.

The trap: sandboxing is the right default, but a one-line error message that should hit context now requires an extra retrieval step. Tune the matcher so genuinely small outputs pass through directly.

Stack them

Each of these tools attacks a different lever. They compose:

Tool	Lever	Ballpark savings
graphify	Re-reads of files / docs / images	71.5× per query on mixed corpus
caveman	Ceremonial prose in agent output	~75% on prose tokens
context-mode	Raw tool output in conversation	~98% on tool I/O

There’s no overlap. Run all three together and the savings multiply: tool I/O gets sandboxed before it ever touches context, codebase memory lives in a graph the agent queries on demand, and the prose the model writes is compressed by default. Most teams I work with hit a 5–10× total token reduction without changing the underlying model or workflow.

The next discipline

Type safety became table-stakes once the cost of undefined is not a function got too high to ignore. Test coverage became table-stakes once “ship and pray” stopped scaling. Token economy is the next one — once you’ve watched a pnpm install log eat 40% of your context window mid-session, you stop trusting setups that don’t have a sandbox boundary.

The teams who internalize this ship faster agents, at lower cost, with fewer “the model forgot” incidents. The tools to do it are open-source, MCP-compatible, and installable in under five minutes each.

If you’re running an agent setup that feels expensive, slow, or forgetful — and you’re not sure which of the eight levers is the one to pull first — that’s a conversation I’m happy to have.