Model benchmark
How well do local models summarize a coding session? Rows below come from users
who ran trimwire summarizer benchmark and opted in via trimwire share benchmark. Each is
anonymous and content-free — only coarse buckets (model family + size tier,
not the raw tag), with small groups suppressed (k-anonymity). No summaries,
prompts, paths, or identifiers are collected. See
how the data is collected & protected.
This is a directional sanity-check, not an authoritative quality ranking.
FCS (faithful-compression score) = retention × compression rewards keeping
the load-bearing facts and actually shrinking the text — a verbatim copy or a
fact-dropper both score low. (“Compression” here is how much the summary shrank
the excerpt — distinct from the reduction trimwire does on your request bytes.)
But a confident false completion (“tests passed” when none ran) is the most
dangerous failure and FCS can’t see it, so treat any non-zero false-done % as
disqualifying regardless of FCS. The curated approved-model list (a blind human
read) stays the authority.
Loading…