Key Takeaway: If your CFO doesn't know about it, it didn't happen.
There's a version of AI transformation where the metrics are great and the business is the same. Adoption rates, sessions per user, prompts processed, hours saved in survey responses. The deck looks good. The steering committee is pleased. The vendor sends a case study request.
And then you look at the operating statement and nothing moved.
I've been on both sides of this. The engagements where the metrics were real and the business changed — and the ones where the activity was real and the metrics measured the activity. The difference isn't the technology. It's whether anyone asked the P&L question before the work started.
The P&L question is simple: what number in the operating statement changes, and by how much? Not "how much time will employees save" — what does that translate to in headcount, in margin, in revenue per unit? If you can't answer that question at the start, you're building toward a metric that feels like progress and isn't.
Six months ago I walked into a $120M specialty finance operation that was unprofitable. The mandate wasn't "deploy AI" — it was "fix the business." We deployed 50 agents and replaced 20 people with 10 plus the agents. Operating margins improved by 1,000 basis points. The business went from unprofitable to profitable. That's the P&L test passing.
The agents weren't the cause. The organizational restructuring was. The agents made the restructuring possible — you can't replace 20 people with 10 unless the 10 have leverage the 20 didn't. But if I'd measured the engagement by "agents deployed" or "processes automated," I could have reported success with zero financial impact. The temptation is real. The activity numbers are always impressive.
What the wrong metrics look like
Productivity proxies: hours saved, time-to-complete, error rate reduction. These are inputs to the P&L, not outcomes. An hour saved by an employee who still works full-time isn't an hour saved — it's an hour redirected. Redirected to what? If the answer is "other productive work," that productive work needs to be identifiable and measurable. If it isn't, you've improved a feeling, not a business.
Adoption metrics: daily active users, session length, features used. These measure whether the tool is being used, not whether using it made any difference. Every enterprise software deployment ever has had impressive adoption metrics in the first quarter. They're the metric you optimize when you can't show anything else.
Satisfaction scores: employees report they're more productive, that the tool saves them time, that they'd recommend it to a colleague. Self-reported productivity data has a structural problem — nobody reports that a tool made no difference after their company paid for it and their manager asked them to use it.
None of these are useless. They're useful diagnostic signals. They become the problem when they're treated as outcomes.
What the right metrics look like
Revenue per employee. Margin expansion. Cost per unit processed. Customer acquisition cost. Time to close. These show up in the financials without anyone's interpretation applied. If AI worked, one of these numbers moved. If none of them moved, you have a productivity theater problem, and the question is whether you want to know that now or in eighteen months when the renewal conversation happens.
The hard version of this is that the P&L test exposes engagements that looked successful and weren't. That's uncomfortable for everyone involved. The vendor has a case study. The internal champion got promoted for spearheading the initiative. The steering committee approved the budget. None of that changes whether the operating statement moved.
Why it matters for how you structure the work
If you're measuring the right thing from the start, the work looks different. You're not asking "what workflows can we automate" — you're asking "what does this business spend money on that it shouldn't, and can AI change that equation." The answer is almost always in the cost structure and the headcount model, which means the real work is organizational, not technical. The AI is the lever. The organizational change is the lift.
That's a harder conversation to start. It's also the only one worth having.