CRM-Ops Bench

Raw tools vs governed writes, six models

CuP = task fully completed, zero safety violations, no run errors. Ordered by raw-arm CuP.

Raw — direct CRM tools onlyGated — writes via audit→approve→apply

1,088 runs · 17 scenarios × tool-surface arms × repeated trials · six models, three vendors · graded from final CRM state + mutation log · open-source harness

Full matrix

Every framework-equipped arm (gated or informed) beats the raw arm, for all six models. Gated is the safest configuration in every case and the top-scoring arm for five of six — GPT-5.5's informed arm edged its gated arm by a single run. Three models tie at 100% CuP under the gate.

#	Model	Arm	CuP	Accuracy	pass^2	pass^4	Violations
1	Opus 4.8^‡	Gated	100.0%	100.0%	1.00	—	0
1	Sonnet 4.6	Gated	100.0%	100.0%	1.00	1.00	0
1	Kimi K2.6	Gated	100.0%	100.0%	1.00	1.00	0
1	GPT-5.5	Informed	100.0%	100.0%	1.00	1.00	0
5	GPT-5.5	Gated	98.5%	99.8%	0.97	0.94	0
6	GPT-5.5	Raw	97.1%	99.3%	0.94	0.88	1
7	Haiku 4.5	Gated	92.6%	96.5%	0.88	0.82	10
8	Kimi K2.6^†	Informed	89.7%	97.0%	0.82	0.71	5
9	Opus 4.8^‡	Raw	85.3%	97.2%	0.82	—	4
10	Sonnet 4.6^†	Informed	80.9%	95.4%	0.75	0.71	21
11	Kimi K2.6	Raw	77.9%	92.1%	0.68	0.59	8
12	GPT-5.4-mini	Gated	73.5%	77.5%	0.64	0.53	33
13	Sonnet 4.6	Raw	72.1%	94.4%	0.71	0.71	32
14	GPT-5.4-mini	Informed	70.6%	80.3%	0.60	0.53	47
15	Haiku 4.5^†	Informed	67.6%	92.1%	0.60	0.47	33
16	Haiku 4.5	Raw	60.3%	87.0%	0.56	0.53	381
17	GPT-5.4-mini	Raw	44.1%	67.1%	0.29	0.18	62

‡ Opus 4.8 ran a reduced raw + gated × 2-trial protocol (pass^2 shown, no pass^4, no informed arm). † The informed (raw + fullstackgtm) arm for Sonnet, Haiku, and Kimi predates the framework's latest release — a conservative understatement of that arm; the raw and gated arms are version-consistent across all six models.

Per-run records: runs.jsonl. Reproduce: npm run eval -- --scenarios all --arms raw,raw+fsgtm,fsgtm --trials 4 (smoke test needs no API keys). Methodology: how we evaluate agents on CRM work.

What's being tested

fullstackgtm is an open-source plan/apply engine for CRM data (Apache-2.0, CLI + MCP server). Agents read everything; every proposed write becomes a typed patch operation — object, field, before, after, reason, risk — applied only after explicit approval, with preconditions re-verified at apply time.

The arms run the same model on the same tasks and differ only in tool surface:

Raw — direct CRM read/write tools.
Informed — the same raw tools, with the fullstackgtm CLI also available.
Gated — reads stay raw; every write must go through audit→approve→apply.

The evaluation set

17 CRM-operations scenarios — 14 synthetic, 3 seeded from an anonymized real HubSpot portal: duplicate merges with survivor choice, ownership reassignment while another writer drifts records mid-task, stale-pipeline cleanup, quarter-end reconciliation, junk-contact cleanup, amount backfills, territory handoffs with exception conditions.

The mock CRM reproduces the API hazards that break agents in production: paginated responses, search-index lag on freshly created records, and concurrent writes. Grading is deterministic — final CRM state plus the server's mutation log, against a fixed violation taxonomy (unauthorized updates, wrong merge survivor, duplicate creates, placeholder writes, lost updates). No LLM judging.

What is CuP (completion under policy)?

A run counts as a success only when the task is fully completed AND zero safety violations occurred AND the run finished without errors. Plain accuracy misses damage done along the way; CuP does not.

What is pass^k?

τ-bench's unbiased estimator of the probability that all k independent trials of the same task succeed, with CuP as the success event. pass^4 answers: would this work four times in a row?

How are runs graded?

Deterministically, from the final CRM state and the server-side mutation log — never from the agent transcript, never by an LLM judge. Violations use a fixed taxonomy: unauthorized updates, wrong merge survivor, duplicate creates, placeholder writes, lost updates.

Does a stronger model remove the need for the framework?

No. The raw arm improves with model strength — 44% CuP for GPT-5.4-mini up to 97% for GPT-5.5 — but every raw arm still logs violations, including Opus 4.8's four drift-class lost updates. The rails take all of them to zero. Capability narrows the gap; it does not close it.

Ready to build your GTM data foundation?

Book a 30-minute call. We'll map your current stack, identify the gaps, and outline what Stage 3+ looks like for your team.