Performance

trimwire optimizes for clean context — window headroom, not your bill. Two kinds of numbers below: live per-mode reductions from real claude -p sessions (what the home-page inspector shows), and byte-exact offline replay of deterministic /v1/messages bodies through the real strategy code. Byte figures are exact and reproducible; cost/token figures are estimates; timing is host-dependent.

0–99% lighter

Request size shrinks by session shape — nothing when there’s no redundancy, a lot on long tool-heavy sessions. trimwire never invents savings.

Headroom, not money

The reliable win is context-window headroom. Net cost is non-monotonic under prompt caching: a wash-to-slight-loss on short sessions, ≈ −55% at 256 turns. We report headroom and refuse to quote a dollar figure.

Sub-2 ms overhead

Pruning runs off the network critical path; the streamed response is forwarded byte-for-byte, unbuffered.

Safe on every corpus

Orphan-free (never splits a tool call↔result pair), system and tools[] untouched on the default path, validated across the corpus set plus a 3,000-body fuzz.

Live per-mode reduction (real `claude -p` sessions)

The interactive inspector on the home page shows these numbers. They are live, not replay: a real Claude Code (claude -p, Haiku) session driven through the actual gateway, reduction read from the preserved in=/sent= wire logs. Each mode was measured on the same two ~800 KB workloads so they’re directly comparable. The reduction depends on the session shape, so the inspector lets you pick one:

File-heavy — a session dominated by file reads / tool output (~810 KB):

Mode	largest request	sustained median
gentle	35%	24%
default (model-free)	73%	41%
summarizer (GLM-5.2)	73%	51%

Here the model-free passes already remove the bulk (re-readable file content is stubbed), so default leads and the summarizer ties on size — the summarizer’s extra value is fidelity: the agent keeps a summary of the old turns instead of losing them.

Chat-heavy — a long session that’s mostly discussion, with no tool reads (~790 KB):

Mode	largest request	sustained median
default (model-free)	15%	11%
gentle	21%	4%
summarizer (GLM-5.2)	78%	39%

Here there’s almost nothing for the model-free passes to elide (no tool output), so default caps at ~15% — while the summarizer folds the discussion itself and is the only mode that materially shrinks the request. A stronger summarizer model with a larger slice budget compresses more: a small local qwen3.5:4b reaches ~31% on the same session; GLM-5.2 reaches ~78%.

Byte-level methodology (offline replay)

The live numbers above tell you what happens on a real session; the offline replay tells you exactly what the strategies do to the bytes, deterministically. We build synthetic-but-realistic /v1/messages bodies — known session shapes (tool-heavy, chat-heavy, mixed) at known sizes — and run them through the real strategy code, the same module the gateway calls in production. Because the input is fixed, the output is byte-exact and reproducible: the same body in always yields the same body out, so the reported reduction is a measured fact, not an estimate. This is also where the safety invariants are checked — orphan-free (never splitting a tool-call↔result pair), system and tools[] untouched on the default path — across the full corpus set plus a 3,000-body fuzz. The figures and the harness live in benchmark/results/RESULTS.md.

Why “headroom, not dollars”

Prompt caching makes net cost non-monotonic: pruning can bust a cached prefix and cost more on a short session, while saving substantially on a long one. Quoting a single “% saved on your bill” would be dishonest. What trimwire reliably does is keep the request focused — the share that is the recent window you’re actually working on — and low-redundancy — less repeated tool output crowding the model. On a redundancy-heavy corpus, focus climbs (e.g. 41.6% → 67.2%) purely by shedding dead weight.

See it yourself

trimwire preview <session.jsonl> — replay the strategies over one of your own recorded transcripts, read-only, no network, and see exactly what would trim.
The full methodology, every corpus, and the cost model live in the repo: benchmark/results/RESULTS.md.