Skip to content

Model benchmark

How well do local models summarize a coding session? Rows below come from users who ran trimwire summarizer benchmark and opted in via trimwire share benchmark. Each is anonymous and content-free — only coarse buckets (model family + size tier, not the raw tag), with small groups suppressed (k-anonymity). No summaries, prompts, paths, or identifiers are collected. See how the data is collected & protected.

This is a directional sanity-check, not an authoritative quality ranking. FCS (faithful-compression score) = retention × compression rewards keeping the load-bearing facts and actually shrinking the text — a verbatim copy or a fact-dropper both score low. (“Compression” here is how much the summary shrank the excerpt — distinct from the reduction trimwire does on your request bytes.) But a confident false completion (“tests passed” when none ran) is the most dangerous failure and FCS can’t see it, so treat any non-zero false-done % as disqualifying regardless of FCS. The curated approved-model list (a blind human read) stays the authority.

Loading…