Model benchmark

How well do models summarize a coding session? Rows below come from users who ran trimwire summarizer benchmark and opted in via trimwire share benchmark. Each is anonymous and content-free — only coarse buckets (model family + bucket, not the raw tag), with small groups suppressed (k-anonymity, k=5). No summaries, prompts, paths, or identifiers are collected. See how the data is collected & protected.

Both local (ollama) and API/provider models appear here, but they are never ranked head-to-head — use the Local / API filter. API scores are a directional cross-check (different context windows, temperatures, pricing), not equivalent to local-model scores. API rows are labeled by provider route (anthropic/openai/openrouter/…; a single cell that spans routes shows mixed), and partial-corpus runs are tagged partial. (Provider/model call failures are tracked on the wire but not yet published, so there is no reliability column here yet — a run that failed isn’t uploaded.)

The leaderboard fills in over time, as each model group crosses the k-anonymity threshold; until then it shows an honest empty-state (try the ?demo view to preview the populated layout). The curated approved-model list stays the authority meanwhile.

This is a directional sanity-check, not an authoritative quality ranking. FCS (faithful-compression score) = retention × compression rewards keeping the load-bearing facts and actually shrinking the text — a verbatim copy or a fact-dropper both score low. (“Compression” here is how much the summary shrank the excerpt — distinct from the reduction trimwire does on your request bytes.) But a confident false completion (“tests passed” when none ran) is the most dangerous failure and FCS can’t see it, so treat any non-zero false-done % as disqualifying regardless of FCS. The curated approved-model list (a blind human read) stays the authority.

Loading…