H100 Triton RMSNorm Experiment¶

Date: 2026-05-26 local / 2026-05-27 UTC

This experiment ran HarnessGym on a real NVIDIA H100 80GB HBM3 host over SSH. Codex ran locally with the exec runner, while all objective GPU scoring ran on the H100 through examples/triton_rmsnorm_h100_task/remote_h100.py.

Task¶

Template:

examples/triton_rmsnorm_h100_task/

Objective:

python3 verifier.py --json --mode final --warmup 10 --repeats 20

Score key: best_us; lower is better.

The task optimizes a Triton fused RMSNorm + SiLU gate kernel across held-out H100 shapes, including an 8192-wide row case that punishes dev-only tuning.

Command Run¶

rm -rf tmp/h100_triton_real_20260526
mkdir -p tmp/h100_triton_real_20260526
cp -R examples/triton_rmsnorm_h100_task/. tmp/h100_triton_real_20260526/

HARNESSGYM_GPU_HOST=<user@h100-host> \
HARNESSGYM_GPU_PORT=<ssh-port> \
HARNESSGYM_GPU_KEY=~/.ssh/id_ed25519 \
PYTHONPATH=src \
python3 -m harnessgym.cli run \
  --task tmp/h100_triton_real_20260526/task.md \
  --workspace tmp/h100_triton_real_20260526 \
  --iterations 2 \
  --attempt-timeout 5m \
  --reflection-timeout 3m \
  --build-timeout 5m \
  --post-attempt-command 'python3 remote_h100.py --workspace h100_triton_real_post -- python3 verifier.py --json --mode final --warmup 10 --repeats 20' \
  --post-attempt-score-key best_us \
  --post-attempt-timeout 3m \
  --score-key best_us \
  --stop-score 90 \
  --optimization-mode \
  --runner exec

Run id:

20260527T003911Z-32309bdb

Run artifacts were captured under:

tmp/h100_triton_real_20260526/.harnessgym/runs/20260527T003911Z-32309bdb/

Results¶

Baseline final score: 150.016 us
Best post-attempt score: 103.328 us
Best-score reduction: 31.1%
Stop score 90 us: not reached
Final status: tooling_built
Best checkpoint restored: yes
Post-restore verification: passed, 104.576 us on a repeated H100 verifier run

Iteration details:

Iteration	Attempt	Harness State	Post-attempt final score	Notes
1	timed out at 5m	no generated harness yet	`103.328 us`	Codex improved the kernel and HarnessGym captured the independent H100 score despite timeout.
1 build	timed out at 5m, then repaired	generated skill + MCP + tests	n/a	Qualification caught a bad MCP self-test path; repair fixed it and qualification passed.
2	timed out at 5m	1 skill, 1 MCP, 7 active tools at attempt start	`107.552 us`	Fresh Codex session used generated MCP tools 10 times; candidate regressed vs iteration 1, so best checkpoint remained iteration 1.
2 build	completed	MCP expanded to 10 active tools	n/a	Qualification passed with `active_mcp_count=1`, `active_tool_count=10`.

Generated Harness¶

Committed reusable artifact bundle:

examples/triton_rmsnorm_h100_harness_artifacts/.harnessgym/

Generated artifacts:

.harnessgym/skills/h100-triton-rmsnorm/SKILL.md
.harnessgym/mcp/h100_triton_rmsnorm/server.py
.harnessgym/mcp/h100_triton_rmsnorm/harnessgym-mcp.json
.harnessgym/tests/test_h100_triton_mcp.py

The final MCP exposes:

inspect_context
run_objective
sweep_kernel_config
restore_best_checkpoint
guarded_final_verify
sweep_launch_overrides
rank_history
diagnose_source
numerical_probe
update_result_json

Telemetry showed 10 successful generated MCP calls in iteration 2, including inspect_context, diagnose_source, numerical_probe, run_objective, and rank_history.

Validation¶

Direct H100 baseline verifier before the run:

status=passed, best_us=151.040, device=NVIDIA H100 80GB HBM3

HarnessGym baseline post-attempt verifier:

status=passed, best_us=150.016, device=NVIDIA H100 80GB HBM3

Generated MCP self-tests after the run:

python3 tmp/h100_triton_real_20260526/.harnessgym/mcp/h100_triton_rmsnorm/server.py \
  --workspace tmp/h100_triton_real_20260526 \
  --self-test

python3 tmp/h100_triton_real_20260526/.harnessgym/tests/test_h100_triton_mcp.py

Result:

self-test passed
Ran 10 tests in 0.291s
OK

Notes¶

This run proves the end-to-end loop on a real GPU task:

objective H100 scoring was captured before, during, and after HarnessGym iterations;
a same-session reflection/build produced a real MCP harness;
HarnessGym qualification caught and repaired a broken MCP self-test;
the next iteration started a fresh Codex session with the generated skill and MCP active;
generated MCP telemetry confirmed actual tool use, not just artifact activation;
optimization checkpointing restored the best known task workspace after a later regression.

The run did not prove that the generated harness beat the iteration-1 no-harness candidate on this short two-iteration budget. It did prove the real H100 loop works and that the harness generated complex, qualified, reusable tooling.