Skip to content

Tensor Layout Qualification Experiment - 2026-05-19

This note records the real Codex experiment used to validate the HarnessGym artifact qualification work from commit 59ebaa1.

Objective

Prove that HarnessGym can:

  • run a hard optimization task with real codex exec sessions;
  • build reusable repo-local harness artifacts, including an MCP server;
  • qualify those artifacts in a fresh copied workspace before reuse;
  • replay the task with and without the generated harness under the same attempt budget; and
  • show a meaningful improvement when the generated harness is available.

The task was examples/tensor_layout_pipeline_task, a synthetic tensor-layout pipeline optimization problem scored by benchmark.py --json --mode final. Lower best_cycles is better.

Generation Run

Workspace:

tmp/tensor_layout_qual_exp_20260518232349

Run artifacts:

tmp/tensor_layout_qual_exp_20260518232349/.harnessgym/runs/20260519T032349Z-c1ffd293

Command:

PYTHONPATH=src python3 -m harnessgym.cli run \
  --task task.md \
  --workspace tmp/tensor_layout_qual_exp_20260518232349 \
  --iterations 3 \
  --attempt-timeout 5m \
  --reflection-timeout 3m \
  --build-timeout 5m \
  --runner exec \
  --optimization-mode \
  --score-key best_cycles \
  --stop-score 1 \
  --task-state continue \
  --harness-depth deep \
  --post-attempt-command "python3 benchmark.py --json --mode final" \
  --post-attempt-score-key best_cycles \
  --post-attempt-timeout 2m \
  --artifact-repair-attempts 1

Results:

Metric Value
Baseline final score 33,975,173
Iteration 2 score with active MCP tools 1,526,080
Final best score 1,495,982
Final relative cycle reduction 95.5968%
Final cycle ratio 22.71x lower

Generated harness artifacts included:

  • .harnessgym/skills/tensor-plan-optimizer/SKILL.md
  • .harnessgym/mcp/tensor-plan-server/tensor_plan_server.py
  • .harnessgym/mcp/tensor-plan-server/harnessgym-mcp.json
  • .harnessgym/fixtures/tensor_plan_attempt1_best.json
  • .harnessgym/tests/test_tensor_plan_mcp.py

The first generated MCP server qualified in a fresh copied workspace with 1 active MCP and 9 active tools. Later iterations extended the same MCP to 15 active tools.

A/B Compare Run

Compare output:

tmp/tensor_layout_qual_compare_20260519002004

Report:

tmp/tensor_layout_qual_compare_20260519002004/compare_report.json

Command:

PYTHONPATH=src python3 -m harnessgym.cli compare \
  --workspace-template examples/tensor_layout_pipeline_task \
  --task task.md \
  --artifact-source tmp/tensor_layout_qual_exp_20260518232349/.harnessgym \
  --output-dir tmp/tensor_layout_qual_compare_20260519002004 \
  --trials 1 \
  --iterations 1 \
  --attempt-timeout 5m \
  --runner exec \
  --score-key best_cycles \
  --stop-score 1 \
  --task-state continue \
  --post-command "python3 benchmark.py --json --mode final" \
  --post-score-key best_cycles \
  --post-timeout 2m \
  --overwrite

Both trials received one 5-minute attempt. The harnessed trial required active generated harness tooling and passed that validation.

Group Attempt Budget Active MCPs Active Tools Final Score
Plain Codex 5m 0 0 33,975,173
Harnessed Codex 5m 1 15 1,495,982

Harnessed replay result:

  • 95.60% lower best_cycles than plain Codex.
  • 22.71x lower best_cycles.
  • Objective verifier completed successfully in both trials.

Active MCP tools in the harnessed replay:

  • apply_best_verified
  • apply_candidate
  • benchmark_plan
  • bounded_exhaustive_search
  • candidate_diff
  • compare_history
  • export_candidate_fixture
  • local_neighborhood_search
  • numerical_check
  • rank_next_experiments
  • resume_search_history
  • run_objective
  • search_plans
  • trace_summary
  • validate_plan

Notes

  • This was one A/B trial, not a statistical benchmark.
  • Iteration 2 exposed a real edge case: source-workspace activation can diverge from fresh qualification when generated self-tests depend on mutable task state. Final fresh qualification and the A/B harness replay were clean.
  • The compare gate now rejects harnessed trials when required generated MCP tools are unavailable, which protects against accidental "harnessed" runs that did not actually activate the harness.
  • The run and compare artifacts are under tmp/ and are intentionally not committed.