Examples¶
The repository ships a set of self-contained task workspaces under
examples/. Each has a task.md, a benchmark/verifier, and (for the kernel
tasks) fast dev and held-out final modes. They range from a pure-Python warm-up
to a real GPU task, and two of them ship a committed, pre-generated harness
bundle you can replay against immediately.
| Example | Kind | Objective |
|---|---|---|
numerical_debug_task |
pure Python | fix normalized_dot to return cosine similarity |
paged_attention_optimization_task |
Python kernel | reduce best_ms of a paged-attention decode kernel |
cpu_attention_autotune_task |
config autotune | reduce best_cycles editing kernel_config.json only |
c_flash_attention_optimization_task |
C kernel | reduce best_cycles in kernel.c |
cpp_stencil_kernel_task |
C++ kernel | reduce best_cycles in a five-point stencil |
cpu_moe_kernel_task |
C kernel | reduce best_cycles of a top-2 MoE kernel |
tensor_layout_pipeline_task |
layout/DMA | reduce best_cycles editing kernel_plan.json |
triton_rmsnorm_h100_task |
GPU (H100) | reduce best_us of a Triton RMSNorm, scored over SSH |
Two committed harness bundles let you skip generation and replay directly:
examples/tensor_layout_harness_artifacts/ and
examples/triton_rmsnorm_h100_harness_artifacts/.
Start here: the offline demo¶
numerical_debug_task is pure Python and runs with the fake runner — no agent
account, no network:
harnessgym run \
--task examples/numerical_debug_task/task.md \
--workspace examples/numerical_debug_task \
--iterations 2 --attempt-timeout 10s --build-timeout 10s \
--runner fake
Tensor layout pipeline¶
The recommended performance-proof task. Optimizes kernel_plan.json — tiling,
layouts, vector width, DMA staging, prefetch, split-K, scheduling, swizzling,
epilogue fusion — with trace JSON exposing per-case component breakdowns.
harnessgym run \
--task examples/tensor_layout_pipeline_task/task.md \
--workspace examples/tensor_layout_pipeline_task \
--iterations 5 --attempt-timeout 5m --reflection-timeout 3m --build-timeout 6m \
--runner exec --stop-score 1 --score-key best_cycles \
--task-state continue --harness-depth deep
A committed bundle is kept outside the workspace template so plain trials don't see it, ready for a replay comparison:
harnessgym compare \
--workspace-template examples/tensor_layout_pipeline_task --task task.md \
--artifact-source examples/tensor_layout_harness_artifacts/.harnessgym \
--output-dir tmp/tensor_layout_compare \
--trials 1 --iterations 1 --attempt-timeout 5m --runner claude \
--score-key best_cycles --stop-score 1 --task-state continue \
--post-command "python3 benchmark.py --json --mode final" \
--post-score-key best_cycles --post-timeout 2m \
--require-harness-tool-use --overwrite
The bundle also ships run_claude_compare.sh, a preflight + audit wrapper that
defaults to REQUIRE_HARNESS_TOOL_USE=1.
CPU kernel tasks¶
These optimize real compiled kernels and have fast dev / held-out final modes plus optional assembly diagnostics:
harnessgym run \
--task examples/cpu_attention_autotune_task/task.md \
--workspace examples/cpu_attention_autotune_task \
--iterations 5 --attempt-timeout 2m --reflection-timeout 2m --build-timeout 4m \
--runner exec --stop-score 1 --score-key best_cycles \
--task-state continue --harness-depth deep
harnessgym run \
--task examples/cpp_stencil_kernel_task/task.md \
--workspace examples/cpp_stencil_kernel_task \
--iterations 5 --attempt-timeout 5m --reflection-timeout 3m --build-timeout 6m \
--runner exec --optimization-mode --score-key best_cycles --stop-score 1 \
--post-attempt-command "python3 benchmark.py --json --mode final" \
--post-attempt-score-key best_cycles
harnessgym run \
--task examples/cpu_moe_kernel_task/task.md \
--workspace examples/cpu_moe_kernel_task \
--iterations 5 --attempt-timeout 8m --reflection-timeout 3m --build-timeout 8m \
--runner exec --optimization-mode \
--post-attempt-command "python3 verifier.py --mode final --json" \
--post-attempt-score-key best_cycles --post-attempt-timeout 2m \
--stop-score 850000 --score-key best_cycles \
--task-state continue --harness-depth deep
The kernel benchmarks are hardened against benchmark-only tricks — they reject source that patches Python timing or dynamic symbol lookup, validate every timed repeat against varied inputs, and include held-out guard cases.
H100 Triton RMSNorm (real GPU)¶
The only GPU task. Codex runs locally; all objective scoring runs on a real
NVIDIA H100 80GB host over SSH via remote_h100.py.
rm -rf tmp/h100_triton_real && mkdir -p tmp/h100_triton_real
cp -R examples/triton_rmsnorm_h100_task/. tmp/h100_triton_real/
HARNESSGYM_GPU_HOST=<user@h100-host> \
HARNESSGYM_GPU_PORT=<ssh-port> \
HARNESSGYM_GPU_KEY=~/.ssh/id_ed25519 \
PYTHONPATH=src \
python3 -m harnessgym.cli run \
--task tmp/h100_triton_real/task.md \
--workspace tmp/h100_triton_real \
--iterations 2 --attempt-timeout 5m --reflection-timeout 3m --build-timeout 5m \
--post-attempt-command 'python3 remote_h100.py --workspace h100_triton_real_post -- python3 verifier.py --json --mode final --warmup 10 --repeats 20' \
--post-attempt-score-key best_us --post-attempt-timeout 3m \
--score-key best_us --stop-score 90 --optimization-mode --runner exec
The recorded outcomes for these examples are on the Results page; the full run notes are in Experiments.