Skip to content

Claude MCP Tool-Use Telemetry - 2026-05-23

This note records the follow-up hardening after the first Claude Code tensor-layout validation showed activated MCP tools but no machine-readable evidence that Claude actually called them.

Implementation

  • harnessgym.claude_mcp_bridge now logs every generated MCP tools/call to .harnessgym/mcp_calls.jsonl.
  • Each event records server name, tool name, argument-key summary, duration, status, error message, and result size.
  • The Claude runner passes the active MCP server name and manifest tool timeout into the bridge.
  • The Codex exec runner now launches active generated MCP servers through harnessgym.mcp_telemetry_proxy, so Codex worker runs record the same JSONL tool-call telemetry without changing the generated MCP server.
  • harnessgym compare reads the JSONL telemetry and writes mcp_telemetry, mcp_call_count, and mcp_called_tools into compare_report.json.
  • harnessgym compare --require-harness-tool-use marks harnessed trials invalid when generated MCP tools activate but no tool call is recorded.
  • Attempt prompts now explicitly tell the agent that MCP activation alone does not count as harness use.

Validation

Unit suite:

.venv/bin/python -m pytest -q

Result:

63 passed, 6 subtests passed in 8.07s

Syntax checks:

bash -n examples/tensor_layout_harness_artifacts/run_claude_compare.sh
PYTHONPATH=src .venv/bin/python examples/tensor_layout_harness_artifacts/check_claude_compare_report.py --help >/dev/null

Real Claude Attempt

Started a stricter real replay:

ATTEMPT_TIMEOUT=10m TRIALS=2 ITERATIONS=1 POST_TIMEOUT=2m \
REQUIRE_HARNESS_TOOL_USE=1 \
OUTPUT_DIR=tmp/tensor_layout_claude_tooluse_10m_2trial_20260523T184552Z \
examples/tensor_layout_harness_artifacts/run_claude_compare.sh

The first two plain trials were valid:

  • Trial 1: post_score=1,495,982, attempt duration 397.284s.
  • Trial 2: post_score=1,495,982, attempt duration 600.222s.

Harnessed trial 1 immediately proved the new telemetry path by recording real generated MCP tool calls:

resume_search_history
apply_best_verified
trace_summary
rank_next_experiments
local_neighborhood_search
bounded_exhaustive_search
search_plans

The partial run was stopped because telemetry exposed a HarnessGym bug: the Claude bridge used a fixed 30 second MCP response timeout even though the generated tensor-plan manifest declares longer tool timeouts. Several bounded_exhaustive_search calls were clipped by the bridge. The bridge now honors each active server's manifest tool timeout.

Rerun Status

After the timeout fix, a fresh 2-trial rerun was started:

ATTEMPT_TIMEOUT=10m TRIALS=2 ITERATIONS=1 POST_TIMEOUT=2m \
REQUIRE_HARNESS_TOOL_USE=1 \
OUTPUT_DIR=tmp/tensor_layout_claude_tooluse_timeoutfix_10m_2trial_20260523T190630Z \
examples/tensor_layout_harness_artifacts/run_claude_compare.sh

Claude Code returned:

You're out of extra usage ยท resets 7:40pm (America/New_York)

The resulting report correctly marked all trials invalid because the runner failed before useful attempt work. That rerun is not a performance result. The next valid replay should use the same command after the Claude quota reset.