The question
Quantised open-weights models in the 7–9B range now fit on a MacBook. But which model, at which quantisation, with which sampling parameters, gives the best quality-per-second for a given workload? The literature maps each of those axes in isolation — never all three at once, and rarely on a real consumer machine.
So I ran the evaluation myself: five model families — Llama 3.1 8B, Gemma 4 E4B, GLM-4-9B, Qwen 2.5 7B, and DeepSeek-R1-Distill-Qwen-7B — at Q4 and Q8, against a Claude Opus 4.6 baseline, across 16 context-grounded Q&A items with a full temperature × top-p sweep. Over 1,500 scored responses, each judged blind by two frontier models (Opus and Gemini 2.5).
What fell out
- The frontier–local gap is task-dependent, not uniform. Local models reach 93–96% of frontier quality on extractive Q&A and collapse to 55–75% on chain-of-thought. Aggregate "local vs frontier" numbers hide this structure.
- Qwen 2.5 7B at Q4_K_M displaces Llama 3.1 8B as the Pareto-optimal local choice: 94% of frontier quality at 6.8 seconds per response.
- Mild temperature () beats greedy decoding on factual Q&A for GLM-4-9B and Gemma 4 E4B Q8 at statistical significance. Top-p does nothing measurable.
- Q8 gives no quality benefit over Q4 within the 8B class. Gemma's edge over Llama is architectural, not precision-dependent.
- Reasoning-distilled models misfire on extractive tasks. DeepSeek-R1-Distill produces the lowest extractive accuracy of any local model — it wants to think out loud even when the question is one line.
- Self-preference bias is empirically tiny when the rubric is anchored by a reference answer: +0.04 on extractive completeness, +0.19 on chain-of-thought, validated against an independent Gemini judge.
How I worked
The research spine was my spec-driven-development skill: a software engineering methodology where a structured, often machine-readable, specification serves as the primary source of truth, from which code, tests, and documentation are derived. I paired it with a literature-review skill to surface the four research threads this sits at the intersection of, and the gaps at their junctions.
Claude Code did the execution: the evaluation harness, the parameter-sweep orchestration, the judge-prompt tooling, the figure generation, and a fair amount of draft prose. I did the design, the judgement calls, the interpretation, and the editing. That division felt about right — the interesting work is framing the question and reading the results; the mechanical work is everything in between.
Section 3.7 of the paper is a generalised seven-step recipe for running this evaluation on any consumer hardware. The harness, question set, and full results (both judges' scores) are in the GitHub repo.
Read the paper
Full methodology, results, discussion, and the model-selection guide are in the PDF. Source and raw scores: github.com/joe-southin/local-lm.
Good Enough? Quality, Latency, and Parameter Sensitivity of Quantised 7–9B LLMs on Consumer Hardware
Controlled evaluation of five open-weights model families against a Claude Opus 4.6 baseline. 1,500+ scored responses, two independent judges, full temperature × top-p sweep.
Open PDF →