Model leaderboard
Which Ollama model behaves most like fable โ ranked by success, then verify-pass, then fewer steps. Ground truth is each eval task's verify exit code.
#1
qwen3-coder-next
100% success
100%
verify
3.0
steps
0%
tool err
1
runs
$0.0000
$/run
4.2s
latency
#2
ministral-3:3b
0% success
0%
verify
3.0
steps
67%
tool err
1
runs
$0.0000
$/run
5.1s
latency
| # | Model | Runs | Success | Verify | Steps | Tool err | $/run | Latency |
|---|---|---|---|---|---|---|---|---|
| 1 | qwen3-coder-next | 1 | 100% | 100% | 3.0 | 0% | $0.0000 | 4.2s |
| 2 | ministral-3:3b | 1 | 0% | 0% | 3.0 | 67% | $0.0000 | 5.1s |