Skip to content

Results

These are real validation runs preserved in the repository. Each one ran the full HarnessGym path — generate an artifact, qualify it in a fresh workspace, activate it, let the agent actually call it through MCP, and score the result against a no-harness baseline.

What these numbers are

Engineering evidence, not a statistical claim. Each result is one (or a few) real Codex/Claude runs on one machine. They prove the mechanism works end to end: a generated, qualified, activated harness changes what a fresh session reaches in the same model-time budget. Repeat runs vary.

Every comparison below uses equal attempt budgets for the plain and harnessed arms. Where both arms time out, the result proves a better reached score in the same budget, not lower wall-clock completion.


Tensor layout pipeline

The hardest bundled CPU-style task. The agent optimizes kernel_plan.json, which controls tiling, tensor layouts, vector width, softmax strategy, accumulation, DMA staging, prefetch distance, descriptor burst size, split-K, scratchpad, scheduling, swizzling, and epilogue fusion. The benchmark exposes fast dev cases, held-out final cases, and detailed trace JSON.

A real five-iteration run generated a tensor-layout skill plus an MCP server with focused search, off-spine search, candidate proof, deadline-aware apply, fast focused apply, history, final gating, and self-tests.

Mode Before After Reduction
Dev 8,348,345 262,005
Final (held-out) 33,975,173 1,495,982 95.60%

This task is the recommended performance-proof when evaluating whether the harness helps: progress depends on inspecting trace/layout/descriptor interactions and automating sweeps, not on one obvious manual rewrite.


CPU attention autotune

A deterministic CPU custom-kernel autotuning proxy: the agent edits only kernel_config.json, verifies correctness, and optimizes best_cycles across fast dev and held-out final modes.

A five-iteration run generated a skill plus an MCP exposing objective inspection, config validation, dev/final evaluation, score-component analysis, deterministic hybrid search, one-shot autotune/apply/record, neighbor ranking, rollback-safe candidate application, history comparison, and result recording.

Arm Attempts Cumulative attempt time Final held-out score
Plain 5 (all timed out) 600.08 s 1,008,322
Harnessed 1 (timed out) 120.02 s 130,223

87.09% lower held-out best_cycles, reached with about 5× less cumulative attempt time. The generated harness was activated in a fresh replay workspace, used by real Codex through MCP, updated kernel_config.json, passed the final verifier, and beat the no-harness baseline.


C flash attention (CPU)

Optimizes kernel.c for the lowest best_cycles. The benchmark is hardened against benchmark-only tricks: it rejects source that patches Python timing, validates every timed repeat against varied inputs, and includes held-out guard cases.

A two-iteration deep-harness run built a skill plus an MCP exposing run_benchmark, numerical_check, assembly_summary, benchmark_variant, score_delta_profile, sweep_kernel_experiments, and rank_experiments, with a self-test covering framing, every tool, numerical cases, dev/final entrypoints, rollback, and error paths.

Arm Final held-out score (cycles)
Plain 189,498
Harnessed 169,005

Both arms completed two timed-out 300 s attempts; the harnessed arm reached a better verified score in the same budget.


C++ stencil

Optimizes a five-point stencil in kernel.cpp. The benchmark compiles with the system C++ compiler, checks correctness, reports cycles, and can emit assembly diagnostics. It is hardened against patching time.perf_counter_ns from the shared library.

A five-iteration run grew the generated MCP from 10 to 15 active tools — numerical tests, shape-alias tests, benchmark repeats, assembly diagnostics, rollback-safe source trials, NEON row sweeps, dispatch sweeps, microvariant sweeps, history, and ranking.

Arm Final score (cycles)
Plain 43,200
Harnessed 34,934

19.13% lower held-out best_cycles in the same attempt budget.


H100 Triton RMSNorm

The only bundled GPU task. Codex runs locally with the exec runner while all objective scoring runs on a real NVIDIA H100 80GB HBM3 host over SSH.

A two-iteration run improved the independently-verified H100 final score from 150.016 µs to a best checkpoint of 103.328 µs, generated a skill plus MCP server, repaired a qualification failure, and confirmed 10 generated MCP calls in the next fresh Codex session.

A longer four-iteration follow-up started from that generated bundle and improved the verified score from 142.848 µs to 99.744 µs, expanding the committed H100 MCP to 17 active tools: remote health checks, exact and approximate SiLU source sweeps, joint source-plus-launch search, repeated scoring, and next-experiment ranking.

See H100 Triton RMSNorm and the long run.


Paged attention decode

The original end-to-end proof. Starting from a correct but slow paged-attention decode kernel, a real Codex run generated a skill plus MCP server, then replayed the original task with and without those artifacts:

Arm best_ms Attempt duration
Plain 1.6759 144.6 s
Harnessed 1.3588 112.7 s

Generated artifacts: .harnessgym/skills/paged_attention_opt/SKILL.md and .harnessgym/mcp/paged_attention_harness/. The harnessed replay activated the skill into .agents/skills and the MCP server into .codex/config.toml.


Reproducing

Every result above is a harnessgym run followed by a harnessgym compare, with the exact commands recorded in the experiment notes and the project README. The general shape is:

# 1. Generate a harness on a training workspace.
harnessgym run --task task.md --workspace . --iterations 5 \
  --runner exec --optimization-mode --score-key best_cycles --stop-score 1 \
  --post-attempt-command "python3 benchmark.py --json --mode final" \
  --post-attempt-score-key best_cycles

# 2. Replay plain vs harnessed from a clean template, equal budgets.
harnessgym compare \
  --workspace-template examples/<task> --task task.md \
  --artifact-source <run-workspace-or-.harnessgym> \
  --output-dir tmp/<task>_compare --trials 1 --iterations 5 \
  --attempt-timeout 5m --runner exec \
  --score-key best_cycles --stop-score 1 \
  --post-command "python3 benchmark.py --json --mode final" \
  --post-score-key best_cycles

Replay & Compare → Examples →