Summarizer (optional)

trimwire is model-free by default. The summarizer is an optional feature that compresses the old part of a long session with a language model. It never runs unless you configure it.

Never load-bearing. Any failure (model down, slow, bad output, missing API key) silently falls back to model-free pruning. The summarizer is strictly additive.

Quick start: run trimwire summarizer setup — an interactive wizard that asks which engine you want, which model, and (for API engines) which API-key environment variable, then writes the config block for you.

Two config concepts

engine — which backend runs the summary: "model-free" (disabled), "local" (local ollama), or a provider id you define (e.g. "anthropic").
style — the HTTP API shape of a provider: "anthropic" (Anthropic Messages API) or "openai" (OpenAI-compatible /v1/chat/completions; covers OpenRouter and others). Only used in [[summarizer.providers]] entries; local and model-free have no style.

Three engines

model-free (default)

No summarizer, no model calls. This is the starting state for every install. The eight deterministic pruning strategies still run.

local

Sends the prunable slice to a local ollama server on your own machine. No API key, no cloud call, no data leaves your machine.

Default model: qwen3.5:4b (~3.4 GB RAM with keep_alive=0) — the recommended default. It and the other approved local tags (qwen3.5:4b-q8_0, qwen3.5:9b; plus qwen3.5:2b as a warned RAM opt-down) have passed the cost-replay gate and the harm gate (an offline benchmark that checks a model refuses to overstate completed work — “false-done” claims — and retains load-bearing facts from the source slice).

Validated tiers:

Model	RAM	Notes
`qwen3.5:4b`	~3.4 GB	Default — the baseline
`qwen3.5:4b-q8_0`	~4.8 GB	Higher-fidelity quant; fewer false-dones
`qwen3.5:9b`	~6.1 GB	Cleanest on diverse real slices
`qwen3.5:2b`	~2.7 GB	Lighter opt-down; drops more facts (guard warns)

The runtime guard refuses known-bad models (qwen2.5-coder:3b, granite4.1:3b, granite4.1:8b, ministral-3:3b, gemma3:4b, qwen3.5:0.8b) — they hallucinate or overstate completed work.

Cloud API (engine = a provider id)

Sends the prunable slice to a cloud API provider you choose: Anthropic, OpenAI, or any OpenAI-compatible endpoint such as OpenRouter. You supply your own API key via an environment variable. The engine value is your provider’s id (e.g. "anthropic", "openrouter") — not the literal string "api".

See the Privacy section below before enabling this engine.

Manual config

Prefer the wizard: trimwire summarizer setup writes this block for you interactively. To hand-edit, add a [summarizer] block to ~/.config/trimwire.toml (or run trimwire config to open it). The default is engine = "model-free" (no summarizer).

Local engine

[summarizer]
engine = "local"

[summarizer.local]
endpoint = "http://localhost:11434"   # ollama default
model    = "qwen3.5:4b"

Pull the model first:

ollama pull qwen3.5:4b

API engine(s)

Each cloud provider is a [[summarizer.providers]] entry with a short id. The engine (and any fallback entry) references a provider by that id:

[summarizer]
engine = "anthropic"               # a provider id below (or "local" / "model-free")

[[summarizer.providers]]
id          = "anthropic"          # short name; what engine/fallback reference
style       = "anthropic"          # "anthropic" | "openai" (OpenAI-compatible)
base_url    = "https://api.anthropic.com"
model       = "claude-haiku-4-5"
api_key_env = "ANTHROPIC_API_KEY"  # NAME of the env var holding your key (never the key)

You can list several providers and reference each by id. Provider ids must be unique and can’t be local or model-free (reserved).

style picks the wire protocol — anthropic (x-api-key, /v1/messages) or openai (Bearer, /v1/chat/completions). trimwire appends that path to base_url. For a provider whose path isn’t the standard one, set full_url (the exact POST URL); base_url is then ignored and style still selects the auth header + payload shape.

Giving the summarizer its key — use a key file

trimwire never stores the key itself, only where to find it. There are two sources, but for most people the choice is made for you:

trimwire install runs trimwire as an always-up background service (systemd user service on Linux, launchd agent on macOS). A service does not inherit the exports in your ~/.zshrc/~/.bashrc — so an export ANTHROPIC_API_KEY=… that works in your shell is invisible to the gateway, and the summarizer logs env var … is not set and falls back to model-free pruning. Use a key file.

Recommended — api_key_file. A path to a file whose contents are the key (whitespace trimmed; a leading ~/ expands to $HOME). Read at runtime, so it works both as a service and in foreground trimwire run:

[[summarizer.providers]]
id           = "zai"
style        = "anthropic"
base_url     = "https://api.z.ai/api/anthropic"
model        = "glm-5.2"
api_key_file = "~/.zai_key"        # read at runtime — the daemon-safe way

Create it and lock it down (trimwire doctor warns if it’s group/world-readable):

printf '%s' "your-api-key" > ~/.zai_key && chmod 600 ~/.zai_key

Alternative — api_key_env. The NAME of an environment variable (e.g. ANTHROPIC_API_KEY) read from the gateway’s process environment. Fine for foreground trimwire run, but a background service won’t see it unless you import it into the service environment yourself. When both are set, the env var wins; otherwise trimwire falls back to the file.

Provider recipes

These show each provider’s style/base_url/full_url shape with api_key_env for brevity — for a service install, swap in api_key_file (see above); it’s the same key, just read from a file the daemon can reach.

# OpenAI
[[summarizer.providers]]
id = "openai"
style = "openai"
base_url = "https://api.openai.com"
model = "gpt-5.4-mini"
api_key_env = "OPENAI_API_KEY"

# OpenRouter (300+ models behind one OpenAI-compatible endpoint)
[[summarizer.providers]]
id = "openrouter"
style = "openai"
base_url = "https://openrouter.ai/api"      # NOT ".../api/v1" (trimwire adds /v1/...)
model = "minimax/minimax-m3"                # see MODEL-COMPATIBILITY.md for ceilings
api_key_env = "OPENROUTER_API_KEY"

# Anthropic
[[summarizer.providers]]
id = "anthropic"
style = "anthropic"
base_url = "https://api.anthropic.com"
model = "claude-haiku-4-5"
api_key_env = "ANTHROPIC_API_KEY"

# Z.ai — Anthropic-compatible endpoint (simplest)
[[summarizer.providers]]
id = "zai"
style = "anthropic"
base_url = "https://api.z.ai/api/anthropic"
model = "glm-5.2"   # GLM-5.x is reliable; GLM-4.x FAILS @128 KB (see MODEL-COMPATIBILITY.md)
api_key_env = "ZAI_API_KEY"

# Z.ai — OpenAI-compatible endpoint (non-standard /paas/v4 path → use full_url)
[[summarizer.providers]]
id = "zai-openai"
style = "openai"
full_url = "https://api.z.ai/api/paas/v4/chat/completions"
model = "glm-5.2"   # GLM-5.x is reliable; GLM-4.x FAILS @128 KB (see MODEL-COMPATIBILITY.md)
api_key_env = "ZAI_API_KEY"

# Azure OpenAI (deployment URL with api-version → use full_url)
[[summarizer.providers]]
id = "azure"
style = "openai"
full_url = "https://YOUR-RESOURCE.openai.azure.com/openai/deployments/YOUR-DEPLOYMENT/chat/completions?api-version=2024-10-21"
model = "gpt-4o-mini"
api_key_env = "AZURE_OPENAI_API_KEY"

# Self-hosted OpenAI-compatible (vLLM / LM Studio / llama.cpp server)
[[summarizer.providers]]
id = "vllm"
style = "openai"
base_url = "http://localhost:8000"
model = "Qwen/Qwen3-30B"
api_key_env = "VLLM_API_KEY"   # set to any non-empty value if the server ignores it

Fallback

If the primary engine is unavailable, trimwire tries each entry in fallback in order — each is a provider id or "local". model-free is the implicit terminal and never needs to be listed:

[summarizer]
engine   = "local"
fallback = ["anthropic"]    # fall back to the "anthropic" provider if ollama is down

Commands

Command	What it does
`trimwire summarizer setup`	Interactive wizard: configures engine, model, and API key
`trimwire summarizer status`	Show the current summarizer config and whether it is reachable
`trimwire summarizer benchmark [--model <tag\|provider-id>]`	Score a local ollama model or a configured API provider for fact retention and compression. API providers require `--yes` to make real (paid) calls; without it, prints a dry-run cost warning. See `BENCHMARK.md`.
`trimwire summarizer probe [--model <tag\|provider-id>] [--bytes N] [--runs N] [--yes]`	Slice-ceiling fact gate: plant distinctive facts across a synthetic OLD slice at your `slice_char_budget` (or `--bytes`), summarize it with your configured model (or `--model`), and report fact retention by position. Model summaries are non-deterministic — use `--runs 10` to see the distribution (pass-rate / p50 / min); PASS requires ALL runs ≥90%. A single run near the gate is a coin flip. API providers need `--yes` (cost scales with `--runs`); `local`/an ollama tag runs locally.

trimwire doctor also reports the summarizer configuration when engine is not model-free.

How it works

At a reprune checkpoint (every ~8 messages on a large session), trimwire sends the old prunable slice to the summarizer in the background and caches the result. On later turns, the cached summary is replayed verbatim in place of the old slice — so the prompt cache stays warm after the initial one-time bust.

The accumulator (default on; appends new frozen delta segments rather than replacing the whole summary on each re-summarization, so older segments stay byte-frozen and the cache only busts on the new delta) helps most on long sessions. Savings are a range that grows with session length — short sessions are a wash; one long real session (981 turns) reached up to roughly −65% cache-weighted cost vs baseline (observed best case, not a guarantee; the exact figure and method are in the benchmark artifacts).

What “cost” means here. trimwire’s cost figure is cache-weighted tokens — input + 0.1·cache_read + 1.25·cache_creation — because every turn re-sends the whole conversation and Anthropic bills cache reads at ~10% and cache writes at ~125% of the input rate. On a pay-per-token API that maps straight to dollars; on a subscription/Max plan it maps to your 5-hour quota and rate limits (fewer weighted tokens = more turns before you hit the wall). “Savings” is the reduction in that figure vs sending the un-pruned body. It is non-monotonic: pruning that churns the cached prefix can cost more on short sessions — which is why reprune (cache-stable replay) is on by default. The ~−65% above is a long-session best case, not a universal one; short sessions are a wash. See BENCHMARK.md §5.

The summarizer is built from the original, un-pruned messages (trimwire snapshots the request body before the deterministic strategies run) — so the summary always reflects the real conversation, never elision markers.

By default the summarizer only keeps a summary when it beats model-free pruning of that slice. On tool-output-heavy sessions the deterministic strategies usually win; the model earns its keep on old, reasoning-dense content they cannot compress.

Letting the summary own more of the old content (API engines)

A clean summary is far clearer to the model than lossy elision markers, so on a strong API engine you can let the summary cover a much larger fraction of old content (and prefer it even when it isn’t strictly smaller than model-free):

Knob	Default	What it does
`slice_char_budget`	unset → per-engine	Max serialized bytes of the old slice summarized per segment. Unset = local uses a num_ctx-safe cap (~60 KB from the default `max_num_ctx=25600`; ollama’s KV cache is sized for `num_ctx`, so a bigger slice risks OOM), an API-only chain uses ~128 KB (cloud models have 100K+ windows). Set an explicit value to override. Capped to the local size whenever the local engine is anywhere in the chain.
`accept_ratio`	`1.0`	`1.0` = strict (keep the summary only if it’s smaller than model-free). `>1.0` (e.g. `1.5`, recommended for API) keeps a higher-fidelity summary up to `accept_ratio ×` the model-free size, bounded by a 16 KB absolute growth cap. Keep `1.0` for weak local models.

[summarizer]
engine = "myprovider"      # an API-only chain (no "local")
accept_ratio = 1.5         # prefer the clearer summary over lossy trims
# slice_char_budget = 196608  # optional: override the 128 KB API default

These are opt-in: the local-only defaults and the model-free path are unchanged. The deterministic strategies still run (and remain the full fallback when the summarizer is off or fails) — they simply own less of the old region once the summary covers it.

slice_char_budget has a per-MODEL fidelity ceiling — validate before raising it. The 128 KB API default is the conservative starting point, not a universal guarantee — even 128 KB isn’t safe for every model (the GLM-4.x family fails it at N=10; see below). Don’t raise it without probing. Measured retention ceilings (12 verbatim facts, examples/api_harm):

Model class Tested / starting slice_char_budget Notes

Unknown / low-tier models 128 KB (the default) Conservative starting point; not universally safe — probe before relying on it. The GLM-4.x family (incl. 4.5-Air) FAILS even here at N=10 (MODEL-COMPATIBILITY.md); prefer a validated GLM-5.x or another all-pass model.

GLM-5 / GLM-5-Turbo / GLM-5.1 / GLM-5.2 ~768 KB (much more) Whole GLM-5.x family (incl. the current GLM-5.2) is solid at the N=10-verified 128 KB; the ~512 KB and ~92% at 768 KB points are single-run (N=1), directional ceilings — verify with probe … --runs 10 before relying on it. Big coverage win — point the summarizer at a GLM-5-class model and set e.g. slice_char_budget = 720896.

There’s a clear capability cliff between the GLM-4.x and GLM-5 generations. Default stays 128 KB (protects weak models); raise it only on a model you’ve gated. To find a configured provider’s ceiling: trimwire summarizer probe --model <provider-id> --bytes 720896 --runs 10 --yes — keep the pass rate high (retention ≥ 90%, no false-done) before raising it.

Summarizing more often does NOT add coverage. Lowering resummarize_after_bytes just splits the same old-content delta into more (smaller) segments — total bytes summarized is unchanged, and it can hit max_summary_segments early on long sessions and collapse the frozen chain. To cover more of a very long session, raise max_summary_segments (not lower the threshold).

Model class	Tested / starting `slice_char_budget`	Notes
Unknown / low-tier models	128 KB (the default)	Conservative starting point; not universally safe — probe before relying on it. The GLM-4.x family (incl. 4.5-Air) FAILS even here at N=10 (MODEL-COMPATIBILITY.md); prefer a validated GLM-5.x or another all-pass model.
GLM-5 / GLM-5-Turbo / GLM-5.1 / GLM-5.2	~768 KB (much more)	Whole GLM-5.x family (incl. the current GLM-5.2) is solid at the N=10-verified 128 KB; the ~512 KB and ~92% at 768 KB points are single-run (N=1), directional ceilings — verify with `probe … --runs 10` before relying on it. Big coverage win — point the summarizer at a GLM-5-class model and set e.g. `slice_char_budget = 720896`.

Which model can hold how big a slice? See MODEL-COMPATIBILITY.md for our directional pre-tests (e.g. MiniMax-M3 / DeepSeek-V4-pro ≈ 1 MB; GLM-5 ≈ 768 KB; GLM-4.x ≈ 128 KB) — note those are our own small-N measurements, not the live community benchmark.

Limits and long sessions

The summarizer extends how far a session runs before the context fills and slows the drift of facts out of the window — but it is not infinite, and it’s not a substitute for Claude’s own memory of the conversation.

No runtime fidelity gate — your model choice is the safeguard. At request time a summary is accepted only on a size check (it must be smaller than model-free pruning, within accept_ratio); trimwire does not re-verify the summary’s facts on the live path (a second model pass per request would be too costly). The fact-retention harm gate (examples/api_harm, trimwire summarizer probe) is an offline model-vetting tool, not a runtime guard. So a weak or mis-configured model can produce a lossy summary that still gets used. This is why the summarizer is off by default, why recent turns are always kept verbatim, and why you should vet a model with summarizer probe --runs 10 before trusting it (and keep weak local models at accept_ratio = 1.0).

What it preserves well: verbatim tokens it’s prompted to copy (file paths, error codes, identifiers, decisions) and the GOAL/DECIDED/NEXT thread.

What it loses: dense reasoning (the summary is terse by design), facts buried in the middle of large tool outputs (head/tail trimmed), and — on a very long session — the oldest detail as summary coverage shrinks. The accumulator builds a chain of frozen summary segments; at max_summary_segments (default 128 ≈ ~4 MB of old content on an API engine) the chain collapses — it is REPLACED by a single fresh summary of the most recent old turns that fit the budget. The collapse re-summarizes the original message bytes (it is not a telephone-game summary-of-summaries — the old detail is read fresh, not compressed twice), but the older turns that no longer fit the summary window revert to model-free pruning: deterministic stubs you can recover by re-reading the files, never fabrication. Frozen segments themselves replay verbatim — the summarizer never re-compresses its own output, so already-summarized facts don’t drift turn-to-turn.

So on a truly long, multi-day project the model can get less faithful about the earliest work. Honest guidance:

Checkpoint at task boundaries with Claude Code’s /compact (consolidates from the model’s own view), or start a fresh session with a short handoff (“continuing from X; key decisions are Y and Z”) — a fresh session also resets the accumulator chain.
Lean on files, not chat memory: files are always re-readable — ask Claude to re-read the relevant file rather than trust a summary of it. The most recent keep_recent_turns turns are always kept verbatim.
trimwire logs a one-time warning to its terminal when a chain collapse happens — that’s the cue a checkpoint is worth it. The summarizer is a second chance at runway, not a replacement for task decomposition.

Privacy

Local engine

The prunable slice goes to http://localhost:11434 (or your configured endpoint) and stays on your machine. Nothing leaves your machine. No API key is involved.

Cloud API engine

The prunable slice is sent to the provider you configure (e.g. api.anthropic.com, api.openai.com, or your chosen OpenRouter endpoint). This is the old part of your session: it may include older tool outputs, reasoning, and file content from earlier in the conversation.

Key points:

Your key, your provider, your choice. trimwire has no default cloud endpoint. You configure the provider and supply your own API key. The privacy posture is determined by your provider’s data handling policy, not trimwire’s.
The slice is the old part of context (outside the recent keep_recent_turns window), not your current prompt or system prompt.
The summary is returned and cached locally; trimwire does not forward the summary upstream to Anthropic.

If sending older session content to a cloud provider is not acceptable for your use case, use the local engine instead — it is functionally equivalent with no data leaving your machine.