Which coding harness is best?
A harness is the software that wraps a language model and turns it into a working coding agent. We run several of them on the same benchmark with the same model, so each score reflects the harness, not the model behind it.
Leaderboard
No published full benchmark rows are available yet.
| Rank | Harness | Model / effort | Score | Uncached tokens / task | Cost / task |
|---|
Capability vs. token cost
The ranking is by pass rate, but the harness that scores best can also be the one that spends the most. This plots each harness by pass rate against tokens spent per task, so you can see the trade you are actually making.
How to read it: each dot is one harness. The higher a dot sits, the more of the benchmark's tasks that harness solved; the further left it sits, the fewer tokens it spent per task. The best place to be is the top-left — strong scores without burning through tokens. The amber dot is the current top-ranked harness.
Methodology
Same benchmark, same model, same effort setting — currently the same benchmark and model. The only thing that changes from one row to the next is the harness, so the score reflects the harness and not the model behind it. Each harness runs the benchmark several times, so one lucky or unlucky model run doesn't sway its score.
A harness's score is the share of the benchmark's tasks it solved, averaged over several runs, with a ± showing how much those runs varied. Cost and token use sit next to the score as context — a cheaper harness with a similar score may be the smarter pick — but they never change the order.
A run is included only if it finished and we could tell, task by task, whether each one passed or failed. We throw out runs where the test machine or its setup broke, runs that were stopped partway, and runs where the benchmark never reported a clear pass-or-fail — none of those reflect how good the harness is.
If a harness runs out of time on a task, that counts as how it did on that task: graded by the benchmark's automatic checker when there is one, and otherwise counted as a failure. We don't quietly re-run a task until the number looks better.
We rank efficiency by uncached tokens per task — the fresh input a harness sends the model plus the output it gets back, divided by the number of tasks. Cache reads are excluded: most of the input is context the model has already seen, which is cheap and reused, so counting it would mostly measure repeated context instead of real work. Each result also shows the full input, cached, and output breakdown.
How fast a run went stays out of the ranking. Wall-clock speed depends on things that aren't the harness — the machine it ran on, and the model provider's capacity and latency at the time — so it doesn't reflect how good the harness is.