
Fall ’25 release! The new intelligent way of working — reimagining purchasing, travel, expenses, and payments.
MIT says 95% of enterprise GenAI pilots show no ROI; Google says most production users already see value. Both are right — because they’re measuring different worlds. The gap between pilots and production explains the divide.
By submitting this form, you agree to receive emails about our products and services per our Privacy Policy.
If you followed the AI news cycle this summer, you probably felt whiplash. On one side, MIT’s Project NANDA says a stunning 95% of enterprise GenAI efforts are yielding no measurable P&L return, despite tens of billions in spend. On the other, Google’s latest global ROI of AI survey reports most production adopters already see ROI, and that agentic AI is accelerating value. Both can’t be true, unless they’re looking at different worlds. They are. And the gap between those worlds explains why so many pilots stall while a minority sprint ahead.
MIT’s world is the messy middle of enterprise AI: experiments, proofs of concept, and “pilot islands” that never touch real systems. The key finding is stark: $30 – 40B in enterprise GenAI; 95% with zero P&L lift; only 5% of integrated pilots extract meaningful value. The report labels this the GenAI Divide—a growing split between firms that can wire AI into work and those that can’t. It isn’t anti-AI; it’s anti-PowerPoint. The difference is production, not potential.
Google, by contrast, sampled the cohort already using GenAI in production, not just dabbling. Within that population, 74% say they’re seeing ROI on at least one use case; among agentic early adopters, that share jumps to 88%. This is not a contradiction of MIT — it’s the flip side: what happens after you cross the chasm from chatty demos to systems that plan, call tools, and close loops.
Agentic AI — the boring definition, not the hype — means software that can decide and do things: reason over goals, orchestrate steps, call enterprise APIs, and hand off or escalate under human guardrails. Google reports over half (52%) of organizations using GenAI now also leverage agents, and 39% have more than 10 agents in production. That tells you where the ROI lives: not in prompts, but in process.
So the debate we should be having is not “Is AI overhyped?” but “What separates the 5% from the 95%?” Three patterns show up across both reports and in real deployments:
Pilots that never touch source-of-truth systems (ERP, ticketing, CRM, policy engines) can’t move numbers that finance will recognize. Google’s methodology makes this plain: their dataset is production users, which naturally skews toward measurable return; MIT’s lens includes the swamp of experiments that never had a chance to earn. If you want CFO-grade proof, you must plug agents into the stack they already audit.
Where ROI shows up, agents have access to tools and data under governance: they open cases, post journal entries, create tickets, update records, and follow policy. Google’s own “what works” sections emphasize secure access to internal systems and governance first; performance follows. MIT’s 5% are, in essence, those who did exactly this.
The companies reporting returns tend to have strong C-suite sponsorship and a clear definition of value — speed, accuracy, cost, or revenue—instrumented in advance. Unsurprisingly, organizations with comprehensive executive alignment are far likelier to see ROI. That’s not a platitude; it’s the difference between a science fair and a factory.
This is why the “AI bubble” framing misses the point. The data don’t say “AI doesn’t work.” They say AI that isn’t wired into work doesn’t work. If you evaluate language models as knowledge toys, you’ll get toy results. If you treat agents as transaction participants—with identity, policy, and commitments — you get operational leverage.
Finance operations are instructive. They’re structured, policy-heavy, and already instrumented for controls — perfect terrain for agentic systems to prove themselves without inviting existential risk. In practice, the early patterns are emerging in four everyday workflows:
These aren’t moonshots; they’re narrow, measurable, and auditable — precisely why they move the P&L needle.
Meanwhile, budgets are consolidating around what works. As AI infra costs fall, overall spend is still rising, often via reallocation from non-AI budgets, with a mean 26% of total IT spend now pointed at AI. That capital will keep chasing use cases that clear the ROI bar, i.e., agentic automations tied to governed systems.
So what’s the contrarian take leaders should champion?
The wrong metric is “number of proofs of concept.” The right metric is “rate of closed-loop automations per quarter that meet policy and pass audit,” plus the dollars attached. That’s how you collapse the perceived gap between MIT’s reality check and Google’s optimism.
You need a runbook: intent hardening (translate human asks into unambiguous, policy-aware plans), idempotent actions (safe retries without double-spend), rollback semantics (compensations that unwind bad sequences), and observability (trace every tool call and decision). Don’t worry about grand “AI strategies” until you can ship, roll back, and measure an agent the way you do a microservice. (Google’s guidance is blunt here: give agents governed access to enterprise systems and write the rulebook early.)
Anchor ROI in operations, not imagination.
Google’s data show ROI clusters around five areas — productivity, customer experience, business growth, marketing, and security — with rapid time-to-production when use cases are repeatable and data are reachable. Security, notably, is emerging as a first-class agentic domain because the work is event-driven and tool-heavy. That’s not sexy—but it’s bankable.
If your board quotes MIT’s 95%, ask: “How many of those efforts were truly in production?” If your vendor quotes 74% ROI, ask: “Were non-production users included?” These aren’t quibbles; they’re entirely different universes. Google’s own methodology limits claims to organizations using GenAI in production — hence the sunnier numbers. Both truths can coexist; your job is to move from one sample to the other.
In short, the paradox dissolves once you see the dividing line. Most pilots fail; many productions pay. The path from the first to the second is not model magic but systems engineering plus governance. That’s what the 5% already know, and what the 95% must learn fast.
The media will keep chasing the binary — “AI boom!” vs. “AI bust!”—because binaries make good headlines. The better headline for operators is this: AI returns are a function of agency and integration. If your agents can’t call the systems that move money, manage risk, or serve customers, they can’t move your numbers. If they can, they will.
Leaders don’t need another debate about hype. They need a factory for turning intents into actions — and a P&L that notices. Build that, and you won’t have to argue with either report. You’ll be living in the dataset that wins.
Georgi Ivanov is a former CFO turned marketing and communications strategist who now leads brand strategy and AI thought leadership at Payhawk, blending deep financial expertise with forward-looking storytelling.