HarnessGym¶
Run a coding agent on a hard task. Generate the reusable tooling it was missing. Replay the next fresh session with that tooling activated.
The core package has zero third-party runtime dependencies — it ships the
orchestrator, registry, qualification, activation, telemetry, and replay
machinery in pure Python stdlib. Runner backends shell out to whichever agent
CLI you choose (codex or claude); the deterministic fake runner needs no
account at all.
What is HarnessGym?¶
HarnessGym is a framework for iterative agent harness improvement. The
idea is simple: a coding agent is only as good as the tools in its workspace,
so instead of asking it to re-derive the same instrumentation every session,
HarnessGym makes it build that instrumentation once and carry it forward.
Each iteration is a controlled five-phase loop:
- Attempt. A fresh runner session works the primary task — Codex, Claude Code, or the offline fake runner.
- Reflect. The same session names the single highest-leverage tool it was missing: a verifier, an analyzer, an MCP server, a benchmark helper, a fixture, or a skill.
- Build. That one artifact is created under
.harnessgym/, with its own tests. - Qualify. The artifact is activated in a clean copied workspace and self-tested. Anything that does not activate cleanly is quarantined and hidden from future attempts.
- Replay. The next fresh session starts with the promoted registry — generated skills symlinked in, generated MCP servers wired into the runner's config — and the accumulated tooling is active from the first token.
The source of truth stays repo-local under .harnessgym/. Each attempt starts
fresh, but the harness travels from iteration to iteration.
One iteration, start to finish¶
Here is an optimization run on a CPU kernel task. HarnessGym runs Codex on the benchmark, lets it build a harness, qualifies that harness, and replays:
harnessgym run \
--task task.md \
--workspace . \
--iterations 5 \
--attempt-timeout 5m \
--runner exec \
--optimization-mode \
--score-key best_cycles \
--stop-score 1 \
--post-attempt-command "python3 benchmark.py --json --mode final" \
--post-attempt-score-key best_cycles
Every attempt writes a machine-readable result that HarnessGym reads to decide what happens next:
{
"status": "solved | blocked | incomplete | tooling_built | failed",
"verified": true,
"summary": "short description of what happened",
"reflection": {
"selected_improvement": {
"kind": "skill | mcp | tool | verifier | fixture | docs | script | test",
"name": "cpu-attention-autotune",
"reason": "why this is the highest-leverage next addition",
"target_path": ".harnessgym/mcp/cpu_attention_autotune/"
}
},
"verification": {
"status": "passed",
"tooling_tests": [
{ "name": "mcp self-test", "status": "passed", "command": "python3 ..." }
]
},
"metrics": { "best_cycles": 130223, "score": 130223 }
}
--optimization-mode and --post-attempt-command mean HarnessGym scores the
workspace itself after every attempt — even an attempt that was killed by the
timeout before it could write its own result. The best independently-verified
checkpoint is preserved and restored at the end of the run.
How the loop works, phase by phase →
Run it from any runner¶
The same loop drives three backends. Pick one with --runner:
Uses codex exec for attempts and codex exec resume <session_id> for the
reflect/build phases. Generated MCP servers are launched through a
telemetry proxy that preserves Content-Length framing.
Uses claude -p --output-format json and claude -p --resume. Generated
MCP servers are wired in through a repo-local MCP config and a stdio
framing bridge.
harnessgym run \
--task examples/numerical_debug_task/task.md \
--workspace examples/numerical_debug_task \
--iterations 2 --attempt-timeout 10s --build-timeout 10s \
--runner fake
Deterministic, no model access, no network. The quickest way to watch the full attempt → reflect → build → replay loop end to end.
The orchestrator does not care which runner produced the work. Attempt, reflect, build, qualify, activate, replay — the same machinery runs for all of them.
Results¶
These are real validation runs preserved in the repository. They are useful engineering evidence — end-to-end proof that a generated, qualified, activated harness changes what a fresh session reaches — not statistically powered benchmark claims.
| Task | Harness artifact | Observed result |
|---|---|---|
| Tensor layout pipeline | Skill + focused-search MCP | final score 33,975,173 → 1,495,982 cycles (95.6% lower) |
| CPU attention autotune | Config validation, scoring, search, rollback tools | harnessed replay 87.09% lower held-out score, ~5× less attempt time |
| C flash attention | 7-tool MCP with assembly + sweep + ranking | harnessed replay 169,005 vs plain 189,498 cycles, equal budget |
| C++ stencil | MCP grown 10 → 15 active tools | harnessed replay 34,934 vs plain 43,200 cycles |
| H100 Triton RMSNorm | Remote health checks, source sweeps, ranking | 150.0 µs → 103.3 µs; follow-up expanded MCP to 17 active tools |
| Paged attention decode | Skill + MCP search tools | harnessed best_ms 1.3588 vs plain 1.6759 |
Full numbers and methodology →
Also: qualification and telemetry, not vibes¶
A generated tool that looks helpful but silently fails to activate is worse than no tool. HarnessGym refuses to count one:
- Fresh-workspace qualification. After each build, the pre-run task
workspace is copied, only the
.harnessgym/bundle is copied in, generated skills/MCPs are activated there, and MCP self-tests run. Failures go back to the same session for up to--artifact-repair-attemptsrepair builds, then are quarantined — kept on disk for evidence, hidden from future attempts. - Real tool-call telemetry. Every generated MCP
tools/callis logged to.harnessgym/mcp_calls.jsonlwith server name, tool name, argument-key summary, duration, status, and result size. Compare reports surface call counts and called tool names, so a harnessed win can require actual tool use, not mere activation.
What HarnessGym is — and isn't¶
Is: a controlled loop for generating, qualifying, and replaying reusable agent infrastructure — skills and stdio MCP servers — and measuring whether they actually help.
Is: runner-agnostic. Codex and Claude Code are first-class; a deterministic fake runner covers tests and demos with no account.
Isn't: an agent or a model. HarnessGym never writes the kernel for you — it orchestrates the agent you bring and keeps the tools it generates.
Isn't: a benchmark leaderboard. The bundled examples are evidence that the mechanism works, run on one machine; repeat runs vary.
The philosophy behind the loop →
Start here¶
- Getting started — mental model + the offline demo
- How it works — the five phases in detail
- CLI reference — every
runandcompareflag - Qualification — why a broken harness can't win
- Results — H100, CPU kernel, and tensor-layout runs
- Experiments — preserved validation notes