Skip to content

CLI Reference

HarnessGym exposes two subcommands:

harnessgym run      # run the iterative attempt → reflect → build → qualify → replay loop
harnessgym compare  # replay plain vs harnessed attempts and write a comparison report

A second entry point, harnessgym-mcp-call, is a thin client for invoking a generated MCP tool directly (the same path as .harnessgym/runtime/mcp_call.py).

Timeouts accept human strings like 45m, 5m, 90s, 2h, or a raw integer number of seconds.


harnessgym run

Run the loop on a task in a workspace.

Task & workspace

Flag Default Meaning
--task PATH task markdown/text file (mutually exclusive with --task-text)
--task-text TEXT inline task text
--workspace DIR . workspace directory for the task
--iterations N 3 maximum iteration count
--run-id ID auto deterministic run id (otherwise timestamp + random suffix)

A relative --task is resolved against the current directory first, then against --workspace.

Timeouts

Flag Default Meaning
--attempt-timeout 45m attempt-phase timeout
--attempt-timeouts comma-separated per-iteration budgets, e.g. 10s,2m; overrides --attempt-timeout by iteration
--build-timeout 20m build-phase timeout
--reflection-timeout = build reflection-phase timeout

Runner

Flag Default Meaning
--runner exec exec, claude, tui-goal, or fake
--codex-bin codex Codex executable (for exec / tui-goal)
--claude-bin claude Claude Code executable (for claude)
--claude-model Claude model alias or full name
--claude-permission-mode bypassPermissions Claude Code permission mode
--claude-max-budget-usd Claude print-mode per-phase spend cap
--claude-extra-arg extra Claude CLI arg (repeatable)

See Runners for details.

Solving & scoring

Flag Default Meaning
--stop-score X stop early when the metric reaches this score
--score-key KEY score metric key read from result.json metrics/objective
--higher-is-better off treat larger scores as better (default: lower is better)
--build-after-solve off after a verified solve, still reflect/build a reusable harness

Optimization & post-attempt scoring

Flag Default Meaning
--optimization-mode off treat post-attempt scores as an open-ended objective; report improvement even without reaching --stop-score
--post-attempt-command CMD JSON-emitting command run after every attempt, even timed-out ones, for independent scoring
--post-attempt-score-key KEY = --score-key metric key read from the post-attempt command output
--post-attempt-timeout 2m timeout for the post-attempt command
--restore-best / --no-restore-best restore on in optimization mode, restore the best independently-scored workspace after the run

In optimization mode, summary.json records the baseline score, best score, relative improvement, and per-iteration harness usage — so a real improvement isn't lost when an attempt times out after mutating the workspace but before writing its result. See How It Works.

Task state

Flag Default Meaning
--task-state continue continue compounds task edits across iterations; reset restores task files before each new iteration while preserving .harnessgym artifacts

Harness depth

Flag Default Meaning
--harness-depth deep deep steers reflection/build toward executable instrumentation and richer multi-tool MCP servers; standard builds smaller focused artifacts

Qualification

Flag Default Meaning
--qualify-artifacts / --no-qualify-artifacts on validate generated artifacts in a fresh copied workspace before promotion
--artifact-repair-attempts N 1 same-session repair builds to try after a failed qualification

See Qualification.

Example

harnessgym run \
  --task task.md --workspace . \
  --iterations 5 --attempt-timeout 5m \
  --runner exec \
  --optimization-mode \
  --score-key best_cycles --stop-score 1 \
  --post-attempt-command "python3 benchmark.py --json --mode final" \
  --post-attempt-score-key best_cycles

The command prints the run id, the run-artifacts directory, the final status, whether it was solved and verified, and — in optimization mode — whether it improved and the best score.


harnessgym compare

Replay plain vs harnessed attempts across copied workspaces and write compare_report.json. Compare runs are attempt-only (no reflect/build).

Required

Flag Meaning
--workspace-template DIR clean workspace template copied for each trial
--task PATH / --task-text TEXT the task (one is required)
--output-dir DIR where trial workspaces and the report are written

Trials & budget

Flag Default Meaning
--artifact-source PATH workspace or .harnessgym directory holding the generated artifacts
--trials N 1 number of plain and harnessed trials
--iterations N 1 iterations per replay trial
--attempt-timeout 5m attempt timeout per replay iteration
--attempt-timeouts comma-separated per-iteration budgets
--task-state reset task-state mode inside each trial (default keeps attempts comparable)
--overwrite off replace existing trial directories

Scoring

Flag Default Meaning
--stop-score X stop a replay early at this score
--score-key KEY score metric key from result.json
--higher-is-better off larger is better
--post-command CMD objective command run in each trial workspace after the attempt
--post-score-key KEY score metric key from the post-command output
--post-timeout 2m timeout for the post-command

Validity guards

Flag Default Meaning
--require-active-harness / --no-require-active-harness on mark harnessed trials invalid unless at least one generated MCP tool activates
--require-harness-tool-use / --no-require-harness-tool-use off mark harnessed trials invalid unless at least one generated MCP tool call is recorded

See Replay & Compare for what makes a comparison trustworthy.

Example

harnessgym compare \
  --workspace-template examples/cpu_attention_autotune_task \
  --task task.md \
  --artifact-source tmp/<run>/.harnessgym \
  --output-dir tmp/cpu_attention_autotune_compare \
  --trials 1 --iterations 5 --attempt-timeout 2m \
  --runner exec \
  --score-key best_cycles --stop-score 1 \
  --post-command "python3 benchmark.py --json --mode final" \
  --post-score-key best_cycles

The command prints the report path and, per group, the trial count, median attempt and cumulative-attempt durations, best post score, and valid/invalid post-score counts.