The evals gap pro-services firms are learning the hard way
Two of the world's largest professional-services firms had to pull AI-assisted reports this month after some claims didn't hold up. The interesting question is not who slipped; it's what the rest of us learn from a problem the whole industry is still figuring out.
- #ai
- #governance
- #foundry
- #observability
Two professional-services firms had a tough month. KPMG pulled a report titled “Redefining excellence in the age of agentic AI” after several named organizations told the Financial Times that the report’s descriptions of their AI usage were inaccurate. GPTZero attributed the issues to AI hallucinations. The firm’s spokesperson said it was investigating, and reaffirmed its guidelines on human oversight to validate AI-assisted content. A few weeks earlier, EY withdrew a separate report on loyalty rewards programs for similar reasons.
It is easy to read those headlines as a gotcha. I don’t think that is the useful read. Both firms publish a huge volume of research, are doing serious work to bring AI into their own delivery, and are well aware of the responsible-AI principles their reports are meant to embody. The honest read is that they ran into the same problem every team shipping AI-assisted content is running into, including teams much smaller and less scrutinized. The reports that get pulled are the ones that happen to land in front of the right reader at the right time. There are a lot more reports out there.
The pattern matters more than any one example. Three things worth taking seriously.
- The hardest hallucinations are the credible ones. “Organization X uses agent Y to do Z” is exactly the kind of statement a reviewer skims past, because it sounds plausible. The eval that catches this is not “is this output well-written”; it is “is this output independently verifiable, and by whom.” That is a different kind of check, and most teams do not have it wired in yet.
- Policy is not the same thing as pipeline. Almost every organization shipping AI now has a published “responsible AI” guideline. Many fewer have a production workflow that actually fails closed when those guidelines are skipped. The gap between intent and enforcement is where the retractions are being born.
- Volume amplifies the gap. A team publishing one essay a quarter can get away with informal review. A firm publishing dozens of reports a quarter cannot. The same is true of agent deployments: one agent in pilot can be hand-checked; ten agents in production cannot. The verification system has to scale before the output volume does.
This is the gap that the broader Microsoft Foundry observability and evaluation push from Build is trying to close: trace every step an agent takes, evaluate against criteria you define, monitor for drift, and treat the eval set as a production artifact rather than a slide-deck noun. The point is not that any one platform solves it. The point is that “human oversight” is much easier to write than it is to enforce, and the firms that wire enforcement into the workflow (their own, and their clients’) will quietly get fewer of these calls from the FT.
For anyone running an AI program right now, this is the kindest version of a forcing function. Take the half-day to write down the eval criteria your published content actually needs to meet, attach them to the workflow that publishes the content, and make the pipeline refuse to ship when they’re not satisfied. The firms in this week’s headlines are sophisticated, well-resourced, and good at what they do. If this can happen to them, it can happen to any of us. Worth building the guardrail before the guardrail builds itself out of a phone call from a customer.