CRM-Ops Bench

Raw tools vs governed writes, six models

CuP = task fully completed, zero safety violations, no run errors. Ordered by raw-arm CuP.

0 20 40 60 80 100 97.1 98.5 GPT-5.5 85.3 100.0 Opus 4.8 77.9 100.0 Kimi K2.6 72.1 100.0 Sonnet 4.6 60.3 92.6 Haiku 4.5 44.1 73.5 GPT-5.4-mini
Raw — direct CRM tools onlyGated — writes via audit→approve→apply

1,088 runs · 17 scenarios × tool-surface arms × repeated trials · six models, three vendors · graded from final CRM state + mutation log · open-source harness

Full matrix

Every framework-equipped arm (gated or informed) beats the raw arm, for all six models. Gated is the safest configuration in every case and the top-scoring arm for five of six — GPT-5.5's informed arm edged its gated arm by a single run. Three models tie at 100% CuP under the gate.

#ModelArmCuPAccuracypass^2pass^4Violations
1 Opus 4.8 Gated 100.0% 100.0% 1.00 0
1 Sonnet 4.6 Gated 100.0% 100.0% 1.00 1.00 0
1 Kimi K2.6 Gated 100.0% 100.0% 1.00 1.00 0
1 GPT-5.5 Informed 100.0% 100.0% 1.00 1.00 0
5 GPT-5.5 Gated 98.5% 99.8% 0.97 0.94 0
6 GPT-5.5 Raw 97.1% 99.3% 0.94 0.88 1
7 Haiku 4.5 Gated 92.6% 96.5% 0.88 0.82 10
8 Kimi K2.6 Informed 89.7% 97.0% 0.82 0.71 5
9 Opus 4.8 Raw 85.3% 97.2% 0.82 4
10 Sonnet 4.6 Informed 80.9% 95.4% 0.75 0.71 21
11 Kimi K2.6 Raw 77.9% 92.1% 0.68 0.59 8
12 GPT-5.4-mini Gated 73.5% 77.5% 0.64 0.53 33
13 Sonnet 4.6 Raw 72.1% 94.4% 0.71 0.71 32
14 GPT-5.4-mini Informed 70.6% 80.3% 0.60 0.53 47
15 Haiku 4.5 Informed 67.6% 92.1% 0.60 0.47 33
16 Haiku 4.5 Raw 60.3% 87.0% 0.56 0.53 381
17 GPT-5.4-mini Raw 44.1% 67.1% 0.29 0.18 62

Opus 4.8 ran a reduced raw + gated × 2-trial protocol (pass^2 shown, no pass^4, no informed arm). The informed (raw + fullstackgtm) arm for Sonnet, Haiku, and Kimi predates the framework's latest release — a conservative understatement of that arm; the raw and gated arms are version-consistent across all six models.

Per-run records: runs.jsonl. Reproduce: npm run eval -- --scenarios all --arms raw,raw+fsgtm,fsgtm --trials 4 (smoke test needs no API keys). Methodology: how we evaluate agents on CRM work.

What's being tested

fullstackgtm is an open-source plan/apply engine for CRM data (Apache-2.0, CLI + MCP server). Agents read everything; every proposed write becomes a typed patch operation — object, field, before, after, reason, risk — applied only after explicit approval, with preconditions re-verified at apply time.

The arms run the same model on the same tasks and differ only in tool surface:

  • Raw — direct CRM read/write tools.
  • Informed — the same raw tools, with the fullstackgtm CLI also available.
  • Gated — reads stay raw; every write must go through audit→approve→apply.
The evaluation set

17 CRM-operations scenarios — 14 synthetic, 3 seeded from an anonymized real HubSpot portal: duplicate merges with survivor choice, ownership reassignment while another writer drifts records mid-task, stale-pipeline cleanup, quarter-end reconciliation, junk-contact cleanup, amount backfills, territory handoffs with exception conditions.

The mock CRM reproduces the API hazards that break agents in production: paginated responses, search-index lag on freshly created records, and concurrent writes. Grading is deterministic — final CRM state plus the server's mutation log, against a fixed violation taxonomy (unauthorized updates, wrong merge survivor, duplicate creates, placeholder writes, lost updates). No LLM judging.

What is CuP (completion under policy)?

A run counts as a success only when the task is fully completed AND zero safety violations occurred AND the run finished without errors. Plain accuracy misses damage done along the way; CuP does not.

What is pass^k?

τ-bench's unbiased estimator of the probability that all k independent trials of the same task succeed, with CuP as the success event. pass^4 answers: would this work four times in a row?

How are runs graded?

Deterministically, from the final CRM state and the server-side mutation log — never from the agent transcript, never by an LLM judge. Violations use a fixed taxonomy: unauthorized updates, wrong merge survivor, duplicate creates, placeholder writes, lost updates.

Does a stronger model remove the need for the framework?

No. The raw arm improves with model strength — 44% CuP for GPT-5.4-mini up to 97% for GPT-5.5 — but every raw arm still logs violations, including Opus 4.8's four drift-class lost updates. The rails take all of them to zero. Capability narrows the gap; it does not close it.

Ready to build your GTM data foundation?

Book a 30-minute call. We'll map your current stack, identify the gaps, and outline what Stage 3+ looks like for your team.