What's the ROI of our AI? Here's a Template
How to Revolutionize Your Credibility
It’s not that hard to do a solid ROI analysis of the AI you’ve deployed.
All you need are lots of data on AI usage and the suite of key performance indicators (KPIs) you’re interested in, combined with the analytic tools of the Credibility Revolution.
What? You haven’t heard of the Credibility Revolution? That’s a shame, because it’s one of the most important achievements of 21st century economics, and one that’s perfectly suited to help business leaders understand just what they’re getting from their AI deployments. I wrote about the Credibility Revolution here, and discussed how it helps us understand the consequences of the AI revolution.
ChatGPT agrees with me that the CredRev is a big deal, “one of the biggest methodological shifts in the discipline,” but also states that “Outside of big tech and specialized areas, most businesses are not yet making heavy use of credibility revolution methods.”
Why not? Part of the answer is simple lack of familiarity. Over the past year I’ve asked a lot of business leaders if they’ve heard of this revolution. The only one who answered “yes” had a PhD in economics. Another part is the difficulty of getting the data needed to do solid analyses. Enterprises generate a ton of granular data these days. But that is very different than saying that all that data is readily available in a consistent format, widely accessible, and free of security and privacy concerns.
Power Tools with a Long Apprenticeship
The final reason for the slow diffusion of the Credibility Revolution into the business world is probably the most important: it takes a lot of training and skill to use its toolkit properly. Undergraduate and master's programs in economics now teach some of its methods, but becoming proficient in all of them requires a long apprenticeship. The pipeline of Credibility Revolution specialists is not only long, but also thin; for example, there are only about a thousand economics PhDs minted every year in the United States.
So I was wrong at the top of this post. It’s actually hard to do a solid, CredRev-level analysis of what AI did for an organization. But as a recent paper shows, it's well worth it to do such an analysis because of how much it reveals, and the many critical questions it answers.
The paper I'm talking about is "Generative AI at Work,” written by Erik Brynjolfsson, Danielle Li, and Lindsey Raymond (BLR, from now on), and recently published in the super-fancy Quarterly Journal of Economics, one of the top two journals in the discipline.
I admit to being biased about this paper. For one thing, it's about the business use of AI, a topic I'm spending a lot of time on these days. For another, the lead author is my colleague, coauthor, cofounder, and dear friend Erik B. Erik is a ridiculously talented and prolific author. And of all his papers this is one of my favorites, for reasons soon to be revealed.
The final reason I'm biased is that Erik and I, along with Daniel Rock and James Milin, co-founded Workhelix to bring this level and type of analysis from economics journals to enterprises. We believe that leadership teams making big decisions about AI efforts need the kind — and caliber — of analysis conducted in this paper. “Generative AI at Work” is rigorous, but it's not dry, abstract, or theoretical. Instead, it’s just what decision makers are looking for as they allocate capital and attention to AI, which is pretty clearly the most important innovation of their careers.
AI ROI Q&A
Why is BLR such an important paper for these folks? Because it provides a template for how to measure, understand, and amplify the results of an AI deployment. In this case, it was the deployment of an AI chatbot into a customer service environment. This bot didn't try to substitute for human agents; it instead coached them. It teed up responses to customer queries submitted by customers via online chat. Agents were free to ignore these or to use these suggested responses.
I'll present BLR's results in the form of questions and answers. I'm imagining the questions coming during a board meeting (boards these days are really curious about how AI efforts are going) and the answers coming from a CEO or CTO. When providing answers, I'll rely heavily on the excellent graphs included in BLR; they tell a clear and compelling story1, the TL;DR of which is that this AI project was a home-run success.
BLR don’t specify why the company adopted this AI, but it's a safe bet they did so at least in large part to improve customer service agent productivity, which in this organization was primarily measured in resolutions per hour of customers’ issues. So let's start there.
Did the new AI increase productivity?
Yes. We can say with high confidence that it did. Using the tools of the Credibility Revolution, we (that is, BLR) estimated the effect — the causal impact — of AI use on productivity (as measured in resolutions per hour) for several months both before and after the new technology went live. Here are the results:
Each dot in the graph represents the estimate for the productivity change brought on by AI. The vertical lines on each dot represent the confidence interval for that estimate. Loosely speaking, the confidence interval is the range in which we can be 95% sure that the true value actually falls. I like to think of confidence intervals as saying “If we spun up a hundred exactly identical universes and deployed the same AI within the same organization in all of them, we would expect that in at least 95 of those universes the actual AI-driven productivity boost at the organization would be within the interval.”
The gray vertical line represents the go-live date for the new AI. Note that for the months prior to AI adoption, the blue dots are all close to zero, and the confidence intervals all include zero. This is exactly what we want to see! After all, we're estimating the effect of AI on performance. So it would be a little bit troubling if we estimated a big positive performance impact from AI before the AI even went live. The fact that we don't see that adds to our confidence in these estimates.
Immediately after the AI goes live, the situation changes. BLR estimate large productivity gains due to the new AI, starting right away and continuing strong.
Are you sure?
Yes, we are quite confident in this finding for a few reasons.
First, we also measured other aspects of productivity, and other KPIs we would expect to see change if productivity really did increase:
Second, we took advantage of the fact that some agents never got the new AI in their first ten months on the job (these folk are represented by the blue line in the graph below), some got it after 5-6 months (green line) on the job, and some had it from their very first day at the company (red line).2 As the graph shows, it's pretty clear that productivity increased once agents got the new technology:
Third, we found that the more agents followed the AI’s suggestions, the higher their productivity gains:
Fourth, we ran the numbers lots of different ways. There's not one best way to conduct a Credibility Revolution analysis, and different techniques can yield different results.3 But in this case, the different techniques yield very similar results:
Were there tradeoffs? Did any important performance measures suffer?
As far as we can tell, no. In addition to all of the above, we also looked at how customer and agent sentiment (as assessed by an AI analysis of chat logs) changed once the new AI was in place, and how often calls got escalated from agents to managers. Nothing got worse, and customer sentiment and manager escalation both improved a lot:
Did the AI help agents do their job better?
Yes. It allowed them to be more easily understood and sound more fluent (About 80% were outside the US):
The AI also enabled agents to take care of customers more quickly, especially on unfamiliar topics:
OK, but did the AI upskill or downskill agents?
It could be that all of the above good things happened because AI was spoon-feeding answers to agents, who just mindlessly pasted them into their conversations with customers. If that were the case, the agents probably wouldn't be learning to do their jobs better. If, on the other hand, agents actually read and thought about what the AI was recommending, they might well become more skillful at solving customer problems.
So which is it? We were able to address this question by taking advantage of AI outages: periods when the new system was down. If the AI were actually helping agents get better at their jobs, then agents would perform better during an AI outage than they did back before there was any AI at all. If, on the other hand, the AI weren't helping agents become more knowledgeable and skillful, then we'd expect their performance to deteriorate back to pre-AI levels during the AI outages.
So, again, which is it? There's pretty clear evidence that agents performed better during outages than they did before the AI was available:
The data here are a bit noisy, but Panel (B) shows that when there was an AI outage, agents still performed better than they did before the AI showed up.
What happened to agent turnover?
It went down a lot once the AI was in place. We found a 10 percentage point decline, from ~25% to ~15%, in attrition among agents with six months or less on the job (see second graph below).
So What’s the ROI?
BLR didn't include classic ROI calculations in their paper, but it's straightforward to use their findings to put together a solid analysis of the financial impact of this AI. It feels to me like there are three tiers of concreteness to such an ROI analysis:
Tier 1: Solid as a rock. Lower agent turnover rates directly translate into reduced onboarding costs. Using some U.S. and Filipino customer service agent onboarding costs that ChatGPT served me, I calculated something on the order of $2M in onboarding savings per year in this organization due to the new AI.
Tier 2: Real but flexible. The new AI clearly gave time back to both to both agents and their managers. It would be straightforward for the company to attach a dollar value to this time since they know what they're paying these people.
It's less straightforward to know what to do with this newly available pile of time/money. BLR found that the average productivity gains for agents were around 15%. Does that mean that the company should fire 15% of its agents now? That feels to me like a hasty and naive conclusion. Should they instead alter their agent and manager hiring plans? That feels more prudent to me, but what do I know? I don't work at the company and don't know what their priorities and strategies are.
But I do know that time is money, So the right way to think about time saved is that it's money back on the table. Should you add this money to your bank account by letting people go, or reinvest it by redeploying them? Again, I don't know; I don’t know your circumstances. But I rarely hear management teams these days talking about how easy it is to find qualified people.
Tier 3: Speculative but evidence-based. BLR found that customer sentiment improved with AI. Does this mean that customer retention will also? That the lifetime value of a customer will go up and customer acquisition costs go down thanks to the AI? BLR don’t assess this, and it's not an easy assessment to make; it would rely on a whole lot of data, a long time, and/or some big assumptions. But BLR's findings on customer sentiment are a solid base for a discussion about how the AI could be impacting customer service so far. And how much that improvement is worth.
The point here is that it’s often straightforward to convert KPI changes brought on by AI into financial impact estimates. What’s difficult is the “changes brought on by AI” part — getting the causal inference right.
What Should We Do Now?
BLR demonstrate that this AI project was a big win for the adopting company. Their research also provides guidance on what to do next: scale this technology up. If it's not being used by all customer service agents at the company, it sure feels like it should be.
Beyond that obvious conclusion, BLR also tee up some intriguing, potentially profit-increasing hypotheses to test:
Should managers have a greater span of control?
Could agent staffing ratios be decreased without sacrificing quality?
Can the company decrease its hiring requirements for experience and fluency (again, without sacrificing quality)?
Do agents now need less formal training, given how good the AI appears to be at upskilling them?
Should agents and their managers be measured on adherence to AI suggestions? Should adherence to AI become an OKR?
The best way to answer these questions — “best” in the sense of “yielding the most precise estimates” — is via randomized control trials (RCTs). In addition to the toolkit of causal inference methods, RCTs are the other major pillar of the Credibility Revolution. Most of BLR is observational causal inference. The researchers didn’t design and run an experiment at the company; they instead took advantage of the “natural experiment” of AI deployment there (and the fact that the deployment was staggered). RCTs, on the other hand, are experimental causal inference.
An RCT on agent training, for example, would randomly take half of a group of agents, reduce the amount of training they get, and see if their KPIs held steady. If they did, it's a pretty safe bet — as safe as you'll find in the world of social science — that training can be reduced without harming outcomes of interest.
As these examples show, BLR’s findings suggest a number of further interventions - beyond just spreading the existing AI - that could have a positive impact. The smart thing for the company to do is to set up a program to test those interventions via RCTs and/or observational causal inference methods (à la BLR).
And Now a Word from Our Sponsor
If any of the above sounds interesting to you, let us know. “Us” here is Workhelix. We do BLR-style analyses, RCTs, and the rest of the CredRev tools, and we do them well. I say this with some confidence because B is one of our co-founders, and because we have a team of causal inference specialists, data scientists, machine learning operations geeks, forward-deployed engineers and strategists, and the other talent we need to bring the Credibility Revolution from economics to enterprises. If you want clean, clear, confident answers to your questions about the results of your AI investments, please get in touch. We'd love to talk to you.
Some of these graphs are in the main body of the paper; others are in the online appendix
Even though Erik has left the Boston area for Stanford, I’m pleased to see that graphs in his papers still use the color scheme of lines on the T.
Learning all these different approaches and building up judgment about when to use each is part of the “fun” of being an economics doctoral student these days.












