CPU MoE HarnessGym Real Smoke - 2026-05-24¶
Workspace copy:
Command:
PYTHONPATH=/Users/ptoulme/harnessgym/src python3 -m harnessgym.cli run \
--task tmp/cpu_moe_harnessgym_real_20260524T155549Z/task.md \
--workspace tmp/cpu_moe_harnessgym_real_20260524T155549Z \
--iterations 2 \
--attempt-timeout 5m \
--reflection-timeout 2m \
--build-timeout 5m \
--runner exec \
--post-attempt-command "python3 verifier.py --mode final --json" \
--post-attempt-score-key best_cycles \
--post-attempt-timeout 2m \
--optimization-mode \
--stop-score 850000 \
--score-key best_cycles \
--task-state continue \
--harness-depth deep \
--run-id real-smoke \
--no-qualify-artifacts
Result:
- Run artifacts:
tmp/cpu_moe_harnessgym_real_20260524T155549Z/.harnessgym/runs/real-smoke - Baseline final
best_cycles:7273200 - Iteration 1 final
best_cycles:5362400(26.27%reduction) - Iteration 2 final
best_cycles:1026800(85.88%reduction) - Post-restore final verifier rerun: passed,
best_cycles=1067068,max_abs=1.2085805193434718e-08 - Final status:
tooling_built; target850000was not reached in this short smoke run.
Harness behavior observed:
- Iteration 1 built
.harnessgym/skills/cpu-moe-optimizer/SKILL.md. - Iteration 1 built
.harnessgym/mcp/cpu_moe_harness/with a Content-Length stdio MCP server. - Iteration 1 built
.harnessgym/tests/test_cpu_moe_harness.py. - Iteration 2 started as a fresh Codex session with the generated skill and MCP active.
- Iteration 2 recorded
16generated MCP calls across:assembly_summary,batch_plan,config_sweep,history_compare,numerical_check,rank_next_experiments,repeat_benchmark,run_benchmark, andtrace_routes. - Iteration 2 extended the MCP with
batch_planandrepeat_benchmark. - Generated tooling tests passed:
python3 .harnessgym/tests/test_cpu_moe_harness.py,python3 -m py_compile .harnessgym/mcp/cpu_moe_harness/server.py .harnessgym/tests/test_cpu_moe_harness.py,batch_plan_mcp_smoke, andrepeat_benchmark_mcp_smoke.
Implementation found by the run:
- Switched
kernel_config.jsonfromroute_mode=tokentoroute_mode=bucketed. - Added NEON-assisted batched expert paths in
moe_kernel.c. - The remaining worst final case is
final_224x80x128_e16_adversarial, around1.03Mcycles on the validation rerun.
Notes:
- The score is timing-based and will vary slightly by host load; the HarnessGym run summary records
1026800, while immediate verifier reruns observed1033865and post-restore observed1067068. - This was a two-iteration smoke, not a full long comparison against equal-budget vanilla Codex.