CRM-Ops Bench
Raw tools vs governed writes, six models
CuP = task fully completed, zero safety violations, no run errors. Ordered by raw-arm CuP.
1,088 runs · 17 scenarios × tool-surface arms × repeated trials · six models, three vendors · graded from final CRM state + mutation log · open-source harness
Full matrix
Every framework-equipped arm (gated or informed) beats the raw arm, for all six models. Gated is the safest configuration in every case and the top-scoring arm for five of six — GPT-5.5's informed arm edged its gated arm by a single run. Three models tie at 100% CuP under the gate.
| # | Model | Arm | CuP | Accuracy | pass^2 | pass^4 | Violations |
|---|---|---|---|---|---|---|---|
| 1 | Opus 4.8‡ | Gated | 100.0% | 100.0% | 1.00 | — | 0 |
| 1 | Sonnet 4.6 | Gated | 100.0% | 100.0% | 1.00 | 1.00 | 0 |
| 1 | Kimi K2.6 | Gated | 100.0% | 100.0% | 1.00 | 1.00 | 0 |
| 1 | GPT-5.5 | Informed | 100.0% | 100.0% | 1.00 | 1.00 | 0 |
| 5 | GPT-5.5 | Gated | 98.5% | 99.8% | 0.97 | 0.94 | 0 |
| 6 | GPT-5.5 | Raw | 97.1% | 99.3% | 0.94 | 0.88 | 1 |
| 7 | Haiku 4.5 | Gated | 92.6% | 96.5% | 0.88 | 0.82 | 10 |
| 8 | Kimi K2.6† | Informed | 89.7% | 97.0% | 0.82 | 0.71 | 5 |
| 9 | Opus 4.8‡ | Raw | 85.3% | 97.2% | 0.82 | — | 4 |
| 10 | Sonnet 4.6† | Informed | 80.9% | 95.4% | 0.75 | 0.71 | 21 |
| 11 | Kimi K2.6 | Raw | 77.9% | 92.1% | 0.68 | 0.59 | 8 |
| 12 | GPT-5.4-mini | Gated | 73.5% | 77.5% | 0.64 | 0.53 | 33 |
| 13 | Sonnet 4.6 | Raw | 72.1% | 94.4% | 0.71 | 0.71 | 32 |
| 14 | GPT-5.4-mini | Informed | 70.6% | 80.3% | 0.60 | 0.53 | 47 |
| 15 | Haiku 4.5† | Informed | 67.6% | 92.1% | 0.60 | 0.47 | 33 |
| 16 | Haiku 4.5 | Raw | 60.3% | 87.0% | 0.56 | 0.53 | 381 |
| 17 | GPT-5.4-mini | Raw | 44.1% | 67.1% | 0.29 | 0.18 | 62 |
‡ Opus 4.8 ran a reduced raw + gated × 2-trial protocol (pass^2 shown, no pass^4, no informed arm). † The informed (raw + fullstackgtm) arm for Sonnet, Haiku, and Kimi predates the framework's latest release — a conservative understatement of that arm; the raw and gated arms are version-consistent across all six models.
Per-run records: runs.jsonl.
Reproduce: npm run eval -- --scenarios all --arms raw,raw+fsgtm,fsgtm --trials 4
(smoke test needs no API keys). Methodology: how we evaluate agents on CRM work.
What's being tested
fullstackgtm is an open-source plan/apply engine for CRM data (Apache-2.0, CLI + MCP server). Agents read everything; every proposed write becomes a typed patch operation — object, field, before, after, reason, risk — applied only after explicit approval, with preconditions re-verified at apply time.
The arms run the same model on the same tasks and differ only in tool surface:
- Raw — direct CRM read/write tools.
- Informed — the same raw tools, with the fullstackgtm CLI also available.
- Gated — reads stay raw; every write must go through audit→approve→apply.
The evaluation set
17 CRM-operations scenarios — 14 synthetic, 3 seeded from an anonymized real HubSpot portal: duplicate merges with survivor choice, ownership reassignment while another writer drifts records mid-task, stale-pipeline cleanup, quarter-end reconciliation, junk-contact cleanup, amount backfills, territory handoffs with exception conditions.
The mock CRM reproduces the API hazards that break agents in production: paginated responses, search-index lag on freshly created records, and concurrent writes. Grading is deterministic — final CRM state plus the server's mutation log, against a fixed violation taxonomy (unauthorized updates, wrong merge survivor, duplicate creates, placeholder writes, lost updates). No LLM judging.
What is CuP (completion under policy)?
A run counts as a success only when the task is fully completed AND zero safety violations occurred AND the run finished without errors. Plain accuracy misses damage done along the way; CuP does not.
What is pass^k?
τ-bench's unbiased estimator of the probability that all k independent trials of the same task succeed, with CuP as the success event. pass^4 answers: would this work four times in a row?
How are runs graded?
Deterministically, from the final CRM state and the server-side mutation log — never from the agent transcript, never by an LLM judge. Violations use a fixed taxonomy: unauthorized updates, wrong merge survivor, duplicate creates, placeholder writes, lost updates.
Does a stronger model remove the need for the framework?
No. The raw arm improves with model strength — 44% CuP for GPT-5.4-mini up to 97% for GPT-5.5 — but every raw arm still logs violations, including Opus 4.8's four drift-class lost updates. The rails take all of them to zero. Capability narrows the gap; it does not close it.
Ready to build your GTM data foundation?
Book a 30-minute call. We'll map your current stack, identify the gaps, and outline what Stage 3+ looks like for your team.