Getting Started¶
The fastest way to understand HarnessGym is to:
- run the offline demo and watch the loop,
- read the result it produces,
- then point it at a real task with a real runner.
Mental model¶
HarnessGym runs an iterative loop. Each iteration is five phases:
- The attempt is a fresh agent session working your task.
- Reflect and build resume that same session to generate one reusable
tool under
.harnessgym/. - Qualify copies that tool into a clean workspace and self-tests it; broken tools are quarantined.
- Replay activates the surviving tools for the next fresh session.
The source of truth is .harnessgym/ in your workspace. Attempts come and go;
the harness accumulates. See How It Works for the full
walkthrough.
The offline demo¶
This needs no agent account — the deterministic fake runner exercises the
whole loop:
harnessgym run \
--task examples/numerical_debug_task/task.md \
--workspace examples/numerical_debug_task \
--iterations 2 \
--attempt-timeout 10s \
--build-timeout 10s \
--runner fake
The fake runner intentionally blocks the first attempt, creates
.harnessgym/tools/harnessgym_fake_probe.py, updates the registry, starts a
fresh second attempt with that registry context, applies the known fix to
kernel_ops.py, runs python verifier.py, and records the verified result.
When it finishes, look at what it wrote:
cat examples/numerical_debug_task/.harnessgym/registry.json
ls examples/numerical_debug_task/.harnessgym/runs/*/iterations/*/
Reading a result¶
Every attempt writes result.json. The shape HarnessGym expects is:
{
"status": "solved | blocked | incomplete | tooling_built | failed",
"verified": true,
"summary": "what happened",
"reflection": {
"selected_improvement": {
"kind": "skill | mcp | tool | verifier | fixture | docs | script | test",
"name": "short name",
"reason": "why this is the highest-leverage next addition",
"target_path": ".harnessgym/<area>/..."
}
},
"verification": { "status": "passed | failed | not_run", "tooling_tests": [] },
"metrics": { "score": 0 }
}
HarnessGym reads status / verified to decide whether the task is solved,
reflection.selected_improvement to know what to build, and metrics to track
optimization progress.
A real run¶
Point HarnessGym at your own task and a real runner. The minimum is a task file and a workspace:
harnessgym run \
--task task.md \
--workspace . \
--iterations 3 \
--attempt-timeout 45m \
--build-timeout 20m \
--runner exec
For an open-ended optimization task, add objective scoring so HarnessGym can checkpoint the best workspace itself — even if an attempt is killed by the timeout mid-edit:
harnessgym run \
--task task.md \
--workspace . \
--iterations 5 \
--attempt-timeout 5m \
--runner exec \
--optimization-mode \
--score-key best_cycles \
--stop-score 1 \
--post-attempt-command "python3 benchmark.py --json --mode final" \
--post-attempt-score-key best_cycles
--stop-score 1 is deliberately unreachable here, so all five iterations run
and the harness keeps maturing.
Writing a good task file¶
A task file is just markdown. The examples that work best with HarnessGym share a few traits:
- An objective command. "Run
python3 benchmark.py --jsonand reducebest_cycles" gives the agent — and HarnessGym's--post-attempt-command— something concrete to score. - Clear edit boundaries. "Edit
kernel.conly; don't changebenchmark.pyor the tolerances" keeps comparisons honest. - Fast vs final modes. A quick dev mode for iteration and a held-out final mode for claiming a robust improvement.
- A nudge toward harness-building. "If building harness improvements, prefer a skill plus an MCP server exposing validation, analysis, search, and rollback tools."
The bundled examples/tensor_layout_pipeline_task/task.md is a good template.
Choosing task state¶
# Compound task edits across iterations (default for `run`).
harnessgym run --task task.md --workspace . --task-state continue
# Restore task files before each new iteration, keeping .harnessgym artifacts.
harnessgym run --task task.md --workspace . --task-state reset
Use continue to keep improving the same working tree. Use reset when you
want each fresh session to face the original task with only the generated
harness carried forward — which is what makes a clean before/after comparison.
What to read next¶
- How It Works — the five phases in detail.
- CLI Reference — every flag for
runandcompare. - Runners — Codex, Claude Code, fake.
- Replay & Compare — measure plain vs harnessed.
- Examples — real tasks you can run today.