Skip to content

Examples

The repository ships a set of self-contained task workspaces under examples/. Each has a task.md, a benchmark/verifier, and (for the kernel tasks) fast dev and held-out final modes. They range from a pure-Python warm-up to a real GPU task, and two of them ship a committed, pre-generated harness bundle you can replay against immediately.

Example Kind Objective
numerical_debug_task pure Python fix normalized_dot to return cosine similarity
paged_attention_optimization_task Python kernel reduce best_ms of a paged-attention decode kernel
cpu_attention_autotune_task config autotune reduce best_cycles editing kernel_config.json only
c_flash_attention_optimization_task C kernel reduce best_cycles in kernel.c
cpp_stencil_kernel_task C++ kernel reduce best_cycles in a five-point stencil
cpu_moe_kernel_task C kernel reduce best_cycles of a top-2 MoE kernel
tensor_layout_pipeline_task layout/DMA reduce best_cycles editing kernel_plan.json
triton_rmsnorm_h100_task GPU (H100) reduce best_us of a Triton RMSNorm, scored over SSH

Two committed harness bundles let you skip generation and replay directly: examples/tensor_layout_harness_artifacts/ and examples/triton_rmsnorm_h100_harness_artifacts/.


Start here: the offline demo

numerical_debug_task is pure Python and runs with the fake runner — no agent account, no network:

harnessgym run \
  --task examples/numerical_debug_task/task.md \
  --workspace examples/numerical_debug_task \
  --iterations 2 --attempt-timeout 10s --build-timeout 10s \
  --runner fake

Tensor layout pipeline

The recommended performance-proof task. Optimizes kernel_plan.json — tiling, layouts, vector width, DMA staging, prefetch, split-K, scheduling, swizzling, epilogue fusion — with trace JSON exposing per-case component breakdowns.

harnessgym run \
  --task examples/tensor_layout_pipeline_task/task.md \
  --workspace examples/tensor_layout_pipeline_task \
  --iterations 5 --attempt-timeout 5m --reflection-timeout 3m --build-timeout 6m \
  --runner exec --stop-score 1 --score-key best_cycles \
  --task-state continue --harness-depth deep

A committed bundle is kept outside the workspace template so plain trials don't see it, ready for a replay comparison:

harnessgym compare \
  --workspace-template examples/tensor_layout_pipeline_task --task task.md \
  --artifact-source examples/tensor_layout_harness_artifacts/.harnessgym \
  --output-dir tmp/tensor_layout_compare \
  --trials 1 --iterations 1 --attempt-timeout 5m --runner claude \
  --score-key best_cycles --stop-score 1 --task-state continue \
  --post-command "python3 benchmark.py --json --mode final" \
  --post-score-key best_cycles --post-timeout 2m \
  --require-harness-tool-use --overwrite

The bundle also ships run_claude_compare.sh, a preflight + audit wrapper that defaults to REQUIRE_HARNESS_TOOL_USE=1.


CPU kernel tasks

These optimize real compiled kernels and have fast dev / held-out final modes plus optional assembly diagnostics:

harnessgym run \
  --task examples/cpu_attention_autotune_task/task.md \
  --workspace examples/cpu_attention_autotune_task \
  --iterations 5 --attempt-timeout 2m --reflection-timeout 2m --build-timeout 4m \
  --runner exec --stop-score 1 --score-key best_cycles \
  --task-state continue --harness-depth deep
harnessgym run \
  --task examples/c_flash_attention_optimization_task/task.md \
  --workspace examples/c_flash_attention_optimization_task \
  --iterations 5 --attempt-timeout 3m --build-timeout 3m \
  --runner exec --stop-score 1 --score-key best_cycles --task-state continue
harnessgym run \
  --task examples/cpp_stencil_kernel_task/task.md \
  --workspace examples/cpp_stencil_kernel_task \
  --iterations 5 --attempt-timeout 5m --reflection-timeout 3m --build-timeout 6m \
  --runner exec --optimization-mode --score-key best_cycles --stop-score 1 \
  --post-attempt-command "python3 benchmark.py --json --mode final" \
  --post-attempt-score-key best_cycles
harnessgym run \
  --task examples/cpu_moe_kernel_task/task.md \
  --workspace examples/cpu_moe_kernel_task \
  --iterations 5 --attempt-timeout 8m --reflection-timeout 3m --build-timeout 8m \
  --runner exec --optimization-mode \
  --post-attempt-command "python3 verifier.py --mode final --json" \
  --post-attempt-score-key best_cycles --post-attempt-timeout 2m \
  --stop-score 850000 --score-key best_cycles \
  --task-state continue --harness-depth deep

The kernel benchmarks are hardened against benchmark-only tricks — they reject source that patches Python timing or dynamic symbol lookup, validate every timed repeat against varied inputs, and include held-out guard cases.


H100 Triton RMSNorm (real GPU)

The only GPU task. Codex runs locally; all objective scoring runs on a real NVIDIA H100 80GB host over SSH via remote_h100.py.

rm -rf tmp/h100_triton_real && mkdir -p tmp/h100_triton_real
cp -R examples/triton_rmsnorm_h100_task/. tmp/h100_triton_real/

HARNESSGYM_GPU_HOST=<user@h100-host> \
HARNESSGYM_GPU_PORT=<ssh-port> \
HARNESSGYM_GPU_KEY=~/.ssh/id_ed25519 \
PYTHONPATH=src \
python3 -m harnessgym.cli run \
  --task tmp/h100_triton_real/task.md \
  --workspace tmp/h100_triton_real \
  --iterations 2 --attempt-timeout 5m --reflection-timeout 3m --build-timeout 5m \
  --post-attempt-command 'python3 remote_h100.py --workspace h100_triton_real_post -- python3 verifier.py --json --mode final --warmup 10 --repeats 20' \
  --post-attempt-score-key best_us --post-attempt-timeout 3m \
  --score-key best_us --stop-score 90 --optimization-mode --runner exec

The recorded outcomes for these examples are on the Results page; the full run notes are in Experiments.