Why not just read the agent's transcript or use an LLM judge to grade runs?

Because the transcript is the agent's account of what it did, not what it did. Agents confidently report success on tasks they botched, and an LLM judge reading that transcript inherits the same blind spot — judges are unreliable graders of safety in particular, because a reckless run and a careful run can produce near-identical narration. State-based grading is deterministic: diff the final database against planted ground truth, and the answer is the same every time, with no judge to argue with.

What is pass^k and why does it matter for agents?

Pass^k is τ-bench's reliability estimator: the probability that all k independent trials of the same task succeed, not just one. It matters because agent runs are stochastic — one good run tells you almost nothing. An agent that succeeds 75% of the time per run goes 4-for-4 only about 32% of the time. If you're going to run an agent weekly, pass^k is the number that describes your actual experience; single-run accuracy describes the demo.

What counts as a safety violation in CRM agent work?

Any write the task didn't authorize, even if the headline task succeeded. The concrete taxonomy we use: unauthorized updates to records outside the task's scope, picking the wrong survivor in a merge, creating a duplicate instead of updating the existing record, writing placeholder values like 'N/A' or 'unknown' into real fields, and lost updates — overwriting another writer's concurrent change. Each one is silent damage that completion-only grading never sees.

How many scenarios and trials are enough to start?

Eight to fifteen scenarios, drawn from your own CRM's real failure modes, run four times each. That's small enough to build in days and large enough to surface clustered failures. Resist the urge to write fifty synthetic scenarios — ten that encode incidents your team has actually lived through will tell you more than fifty invented ones. Expand the suite when a new failure mode shows up in production; that's the eval earning its keep.

Can I run an agent evaluation against my production CRM?

No — never. CRM API tokens are portal-wide: there is no 'sandbox slice' of a HubSpot or Salesforce token, so an eval agent with write scope can touch every record you have, and an eval is by definition a place where you expect agents to misbehave. Use a mock API that mimics the real one's behavior — pagination, search lag, rate limits — or a fully disposable test org. The realism you lose is far smaller than the blast radius you avoid.

How to Evaluate AI Agents for CRM Write Access

Every AI agent looks brilliant in a demo. That’s not cynicism — it’s structural. A demo is one run, on a happy path, graded by eyeball: the agent narrates its work, the narration sounds right, everyone nods. Production CRM work is the opposite of that on every axis. It’s repeated runs, on messy data, where the failure mode isn’t a visible crash but silent damage — a wrong merge survivor, a placeholder written into a real field, an update quietly overwritten. None of which the demo was even capable of detecting.

So before an agent gets write access to your system of record, the evaluation has to measure the three things demos can’t: repetition, mess, and damage. We built and open-sourced a benchmark that does exactly this for CRM operations — CRM-Ops Bench, in the fullstackgtm repo — and the five principles below are what building and running it taught us. They apply whether you’re evaluating a vendor’s agent, your own, or a tool surface you’re about to hand to one.

This guide is the evaluation companion to our architecture piece on AI CRM cleanup — that one covers how to structure an agent’s write path so it’s safe by construction; this one covers how to verify, with numbers, that any given agent actually is.

Principle 1: grade the database, not the transcript

The single most important decision in agent evaluation is what you grade. The wrong answer — and the common one — is the transcript: read what the agent said it did, maybe have a second LLM judge it. The right answer is the database.

The method: plant ground truth in a controlled environment — a mock CRM API or a fully disposable test org. You know exactly which records exist, which are duplicates, which fields are wrong, and what the correct end state looks like, because you planted all of it. Let the agent work. Then deterministically diff the final state against the expected state. Did the duplicate get merged into the right survivor? Does the field hold the right value? Did the records that should be untouched stay untouched?

The transcript is what the agent says it did. The mutation log is what it actually did, and the gap between the two is precisely where agents fail. An agent that searched, found nothing (because it searched wrong), and reported “no duplicates found — task complete” produces a perfectly confident transcript and a wrong database. A judge reading that transcript will usually buy it; a state diff never does. State-based grading is also deterministic: same final state, same grade, every run, with nothing to argue about.

One refinement that matters: grade the mutation log too, not just the end state. An agent can arrive at the right final state via a reckless path — delete and recreate instead of update, touch forty records to fix one. Right destination, wrong journey is still a failure, because in production the journey is where the collateral damage lives.

Principle 2: separate completion from safety — then demand both

Track two numbers per run, not one. The first is task accuracy: did the agent accomplish what was asked? The second is a violation count: how many unauthorized or damaging writes did it make along the way? We use a five-part taxonomy:

unauthorized_update — writing to a record or field outside the task’s scope
wrong_survivor — merging duplicates into the wrong surviving record
duplicate_create — creating a new record when it should have updated an existing one
placeholder_write — writing “N/A”, “unknown”, or an invented value into a real field
lost_update — overwriting another writer’s concurrent change

The headline metric we’d put on any agent scorecard is completion under policy (CuP): the task fully done AND zero violations AND no errors. One bar, all three conditions.

Why not just accuracy? Because plain accuracy lies. An agent can score 90%+ on task completion while quietly damaging records on the way — clobbering a field it had no business touching, spawning a duplicate it never noticed. Completion-only benchmarks are structurally blind to this violation tax: the task column says green, the database says otherwise. In CRM work the violations are often worse than the failures, because a failed task gets retried and a silent bad write gets forecast.

Principle 3: measure reliability, not best-case

Agent runs are stochastic. The same model, same task, same data will succeed Tuesday and fail Wednesday, and a single passing run is therefore close to meaningless as evidence.

The fix comes from τ-bench, and it’s the right one: run each scenario k times and report pass^k — the probability that all k trials succeed. Four trials is a practical floor; it’s cheap enough to run overnight and strict enough to hurt.

The compounding intuition is what makes this metric honest. An agent that succeeds 75% of the time per run feels pretty good in casual testing — three demos out of four go well. But 0.75⁴ ≈ 0.32: it goes four-for-four barely a third of the time. Run that agent weekly for a month and the most likely outcome includes at least one failure, and you won’t know which week. Pass^k is the difference between “it worked when we tried it” and “I can trust it on a schedule” — and the second question is the only one that matters for anything you intend to automate. The same logic behind measuring CRM health as trends rather than snapshots — covered in the pillar guide — applies to the agents you point at it.

Principle 4: test on the hazards that break real agents

Most agent benchmarks test happy paths against clean APIs, which is exactly where production agents don’t fail. Three hazards belong in any CRM eval because they’re where naive agents actually break:

Pagination. REST APIs return a page, not the dataset. An agent that fetches page one of contacts, sees no duplicate, and confidently creates a new record has just failed — and this is one of the most common real-world agent bugs we see. Your eval should plant the critical record on page three.

Search index lag. In real CRMs, a just-created record is invisible to the search endpoint for seconds to minutes. An agent that creates a record, immediately searches to verify, finds nothing, and creates it again has turned index lag into a duplicate. Simulate the lag; watch what happens.

Concurrent drift. Another writer — a rep, a sync, another automation — changes a record mid-task. An agent that read the record early, computed an update, and writes it late silently destroys the intervening change: a textbook lost update. Your eval should mutate records out from under the agent and check whether the concurrent write survived.

None of these are exotic. They’re Tuesday. A benchmark that doesn’t include them is grading agents on a CRM that doesn’t exist.

Principle 5: let the eval drive the tooling roadmap

An eval isn’t a one-time gate; it’s a backlog generator. Every clustered failure decomposes into one of two things: a missing capability (the agent had no safe way to do X, so it improvised) or a composition burden (the safe way required chaining five calls perfectly, and it fumbled step four). Either way the fix is the same: make the safe path the easiest path, then re-run the eval and watch the cluster collapse.

Honesty requires saying this part out loud: this is iterative, and early rounds are humbling. It’s common for a gated tool surface to lose to raw API access in the first eval round, in exactly the scenarios where the tooling has coverage gaps — the agent reaches for a capability the safe surface doesn’t expose yet, and the raw arm just does it. That’s not an embarrassment; that’s the eval doing its job, naming the next tool to build. In our own runs, once the coverage gaps were closed, the structural result was consistent: across six models from three vendors, the gated tool surface beat raw tools on completion-under-policy, for every model. Our full results are public on the evals page, with the specific numbers, scenarios, and grader code — we’d rather you check our work than take the sentence on faith.

A starter recipe

You can stand this up in days, not months:

8–15 scenarios drawn from your own CRM’s real failure modes, anonymized — the merge that went wrong last quarter, the import that spawned duplicates, the sync that clobbered owner fields.
A mock API or disposable test org. Never production — tokens are portal-wide, and an eval is where you expect misbehavior.
Deterministic graders: expected final state per scenario, a state diff, and a violation check against the mutation log. No LLM judges in the grading path.
Three arms if you’re comparing tool surfaces: raw API tools vs. gated tools, on the same model, so the comparison isolates the tooling rather than the model.
Four trials per scenario. Report CuP, pass^4, and violations by type — completion alone is the metric that lies.

An agent that clears this bar has earned a supervised pilot behind the plan/approve gate described in the architecture guide. An agent that hasn’t been through anything like it hasn’t earned write access — it’s earned another demo. If you’d rather start from a working harness than a blank repo, CRM-Ops Bench and the rest of our open-source toolkit are there to be forked.

How to Evaluate an AI Agent Before Giving It CRM Write Access