Why 95% of AI projects fail — and why building in-house makes it worse
MIT found 95% of enterprise AI pilots show no P&L impact; RAND puts AI failure above 80%, twice the non-AI IT rate. The pattern isn't the model — it's the build, and in-house makes it worse.
A COO we talked to this spring had watched a flawless AI demo in March. An internal prototype read their customers’ contracts, pulled the key terms, answered questions about them. The board saw it. Everyone agreed on the obvious move: build it in-house, own the capability. By November the prototype was still a prototype. Two AI-engineer reqs had been open for five months. The pilot that was supposed to save the operations team ten hours a week had quietly been shelved, and roughly six figures of budget had gone with it.
Nothing about that story is unusual. It is close to the median outcome.
Most company AI projects fail. Not in the demo — in production. And the instinct that follows the demo, “this is core, let’s build it ourselves,” is the instinct that makes the failure more likely, not less. The numbers on that are now clear enough to plan around, so let’s go through them, and then through what actually ships.
1. The speed you saw in the demo is not the speed to production
The speed is real in the demo and misleading in production. That gap is the whole problem, and it is measurable.
GitHub’s own controlled study is the number everyone remembers. Developers using Copilot finished a task 55% faster than developers without it — one hour eleven minutes against two hours forty-one. The result was statistically significant. It is a real effect, and it is also a clean, greenfield task — build an HTTP server from scratch — done in a lab with nothing else going on. That is the demo, and the demo is genuinely fast.
The upside underneath the hype is real too, which is what makes the demo so seductive. PwC’s 2025 analysis of close to a billion job ads found productivity growth has nearly quadrupled in the industries most exposed to AI, and revenue per employee in those sectors jumped 27% — more than three times the growth in less AI-ready ones. The instinct after the demo is rational: there is real money here, move now, own it. The mistake is not the urgency. It is assuming the demo’s speed carries into production.
Now the production number. In 2025, METR ran a randomized controlled trial with sixteen experienced open-source developers working on mature repositories they had contributed to for years — projects averaging over a million lines of code and more than twenty thousand GitHub stars. With AI tools allowed, the developers were 19% slower. Not faster. Slower, on exactly the kind of large, real, owned codebase where production work actually happens.
The part that should hold a budget-writer’s attention is what the same developers believed. Before the study, they forecast AI would speed them up by 24%. After being measured going slower, they still estimated AI had sped them up by 20%. The felt sense of velocity and the measured fact of velocity pointed in opposite directions by more than forty points. People are bad judges of whether AI is making them faster, and the error runs optimistic — which is precisely the bias that turns a great demo into a greenlit in-house build.
METR is careful about scope: the result holds for experienced developers on codebases they know well, not for every task everywhere. But that scope is exactly your scope when you build in-house — senior people, your own systems, real stakes. It is the demo’s clean-room scope that is the outlier, not METR’s.
This is not an argument against AI tooling. It is an argument about where the speed lives. Google’s 2024 DORA report, drawn from tens of thousands of technology professionals, found that a 25% increase in AI adoption was associated with an estimated 1.5% drop in delivery throughput and a 7.2% drop in delivery stability — even though about three-quarters of developers said AI made them more productive. The report’s read was blunt: AI lifts individual output, but it does not fix, and can quietly erode, the fundamentals that get software shipped reliably — small batch sizes, strong testing, real review.
So the thing that looked fast in March is the thing that stalls by November. The demo compresses the easy 80%. The hard 20% — the part that decides whether anything reaches production — is untouched by the speed you saw.
2. Adoption is easy. Production is the wall.
Most AI projects die in the stretch between a working demo and a system that touches the P&L. Adoption is no longer the hard part; production is.
Start with the number that traveled the furthest. MIT’s NANDA initiative studied enterprise generative AI in 2025 and found that 95% of pilots delivered no measurable impact on the P&L. Companies had poured an estimated thirty to forty billion dollars into generative AI, and roughly 5% of it produced real operational or financial return. One honest caveat, because the receipt test cuts both ways: that 95% figure drew methodology pushback — the sample, the definitions, what “failure” even meant — so it should not be carried alone.
It does not have to be. RAND interviewed sixty-five data scientists and engineers with five-plus years building models in industry and academia, and concluded that more than 80% of AI projects fail — about twice the failure rate of IT projects that don’t involve AI. Gartner predicted that at least 30% of generative AI projects would be abandoned after the proof of concept by the end of 2025, citing poor data quality, weak risk controls, escalating costs, and unclear business value. And McKinsey’s 2025 State of AI found that only 21% of organizations using generative AI had redesigned any workflow around it; the rest layered AI on top of how work already ran. Thirty-nine percent reported any EBIT impact at all, and most of those put it under 5%. Four different research shops, four different methods, one consistent shape.
The McKinsey split is the sharpest of the four. Only about 6% of organizations qualify as high performers capturing real value from AI, and what sets them apart is not a better model — it is that they redesigned the work around AI instead of bolting it onto the old process, which is what the other four-fifths did. Pilot purgatory is expensive in a way that never lands on a single line: the budget, yes, but also the quarters lost, the team’s confidence spent, and the board’s patience, which does not refill on demand.
The common thread is not model quality. RAND’s named root causes were leadership misreading what AI can do, data foundations that were not ready, and a shortage of the engineers who build this work. MIT’s read was that the tools failed not because the models were weak but because they did not integrate, did not learn, and did not fit the actual workflow. Both point at the same place: the failure is everything around the model.
That is the wall, and it is exactly the part a demo is built to hide. A demo proves the model can do the thing once, on a clean input, with a human steering. Production is the same thing done ten thousand times, on bad inputs, inside your real systems, wired to the rest of your stack, without a person babysitting each call. Crossing from the first to the second is not a model problem you solve by waiting for a smarter model. It is an engineering problem, and it is the one nearly everyone underestimates — which is why the pilot stalls and the budget evaporates.
3. Your developers are good. They’re just new to shipping AI.
Writing AI code and shipping reliable AI are different skills, and almost no one has years of the second one yet, because the second one has only existed at this level for a couple of years.
Look at how the people doing the work actually feel about it. In Stack Overflow’s 2025 survey, 84% of developers use or plan to use AI tools — adoption is settled. But only 3.1% say they “highly trust” the accuracy of what those tools produce, and more developers actively distrust AI accuracy (45.7%) than trust it (32.7%). The most experienced developers are the most skeptical of all — the ones with production accountability trust it least. The single biggest frustration, named by 66%, is AI output that is “almost right, but not quite.” The second, named by 45%, is that debugging AI-generated code takes more time than expected.
“Almost right, but not quite” is the production gap in one phrase. A generalist or a junior ships the almost-right answer because it looks right, and looking right is what AI is extraordinary at. Catching almost-right — knowing the model will be confidently wrong and building the machinery that flags it before a customer sees it — is a senior, production-AI skill, and it is rare.
The survey’s most telling answer is downstream of this: asked when they would still turn to a human in a future full of capable AI, 75% of developers said “when I don’t trust the AI’s answer.” Trust, not raw capability, is the binding constraint, and trust is earned by the system wrapped around the model. The almost-right answer carries a hidden tax — every output has to be re-verified by someone who can spot the small fraction that is subtly wrong — and if no one on the team has the reps to spot it, the tax goes unpaid until it surfaces in production as a number that was confidently, plausibly false.
The maintainability data points the same way. GitClear analyzed 211 million changed lines of code from 2020 to 2024 and found copy-pasted (cloned) code rose from 8.3% of changes in 2021 to 12.3% in 2024, while refactored “moved” code fell from 25% to under 10% — roughly a fourfold growth in code clones over the period. More code, shipped faster, with the long-term health of the codebase quietly bending the wrong way. Cloned code is where defects breed, and a team moving fast on AI output without the reps to refactor as they go is accruing a debt that comes due in production.
This is the same trust gap we have mapped vertical by vertical — real-estate agents who adopted AI everywhere except a price, mortgage shops whose AI stalls at the underwriting decision. In both, the fix was never a bigger model. It was engineered trust: anti-fabrication rules so the system never invents a value, confidence labeling so it reports how sure it is, abstention so it hands off when it is not sure, and a review pass that assumes the output is wrong until proven otherwise. None of that lives in the model. All of it lives in the reps — and the reps are scarce precisely because production AI is young.
4. You can’t hire your way across the gap fast enough
The engineers who have those reps are the scarcest, most expensive, and slowest-to-hire people in the market right now. The build-it-in-house plan runs straight into a labor market designed to defeat it.
PwC analyzed close to a billion job ads for its 2025 Global AI Jobs Barometer. Workers with AI skills command a 56% wage premium — more than double the 25% premium a year earlier. The skills employers want in AI-exposed roles are changing 66% faster than elsewhere, and postings for AI-skilled roles grew 7.5% even as total job postings fell 11.3%. Demand is climbing while the definition of the job moves under everyone’s feet.
The price follows. The median total compensation for a machine-learning or AI software engineer in the US is $244,500 (Levels.fyi) — and that is the median, across everyone with the title. For the people who have actually shipped reliable AI into production, the number that pries them off a current team is higher, and you are bidding against every company that read the same headlines you did. Then there is time: months to fill the seat, more months to ramp, and you do not need one hire but a small cluster — someone to build, someone who owns the data, someone who reviews — because a single engineer is a single point of failure on work this unforgiving. RAND named the talent shortage as a cause of AI project failure; the in-house plan asks you to hire your way out of the exact gap that sinks projects, in the tightest part of the market, on the clock.
Add it up honestly. A senior in-house AI hire is a quarter-million dollars a year and up, all-in, before they ship a line, on a four-to-six-month hire-and-ramp, with key-person risk baked in and a real chance the first hire is not the right one. That is the true cost on the “build” side of the ledger — not the model, which is cheap, but the people who can make it trustworthy, who are not.
And one hire is rarely the unit of work. Reliable production AI needs someone to build it, someone who owns the data it reads, and someone who reviews what it ships — three competencies that occasionally live in one rare person and usually do not. Run the honest twelve-month math on the in-house path: a senior hire plus a supporting engineer, four-to-six months before the first thing ships, a real chance the first hire is the wrong one, and the opportunity cost of every quarter that passes while the work that was supposed to start this quarter still has not. The model is the cheapest line on that budget. Everything expensive is human, scarce, and slow.
5. The data favors partnering — and here’s what good looks like
The same MIT study that found 95% of pilots fail also found the way out, and it is the opposite of the demo-day instinct.
In MIT’s data, AI tools bought from specialized vendors or built through partnerships succeeded about 67% of the time. Internal builds succeeded roughly a third as often. Call it two to one in favor of not building it yourself. That is not a marketing line from an agency; it is the finding sitting next to the famous failure number in the same report, and it lines up with everything above. Buying or partnering works more often because it imports the reps — the evals, the abstention logic, the anti-fabrication discipline, the review process — on day one instead of month six, from people who have already failed and fixed these systems somewhere else, on someone else’s budget.
There is no magic in it. A team that has shipped reliable AI across a dozen systems has already met the failure modes a first-time in-house build meets for the first time in production — the source data that disagrees with itself, the model that fabricates under load, the prompt change that silently degrades accuracy, the edge case that was not in the demo. They built the evals, the abstention thresholds, and the review discipline once, somewhere else, and they carry it in. Buying or partnering is not buying a model. It is buying the scar tissue.
What does “good” look like on the partner side? Receipts, not adjectives. A few from production work:
A deterministic eligibility engine for a multi-program home equity platform (HEI, HEA, and sale-leaseback) — five production calculators, the decision logic in versioned code a human can read and audit, and zero language models in the decision path. The model preps and extracts; a versioned rules table makes the call; every decision writes one audit row. AI sits exactly where it is reliable and stays out of where it is not.
A HubSpot-replacement CRM for a 100k-user real-estate platform — 101 data models, a 3,728-line schema, role-based access, real-time push — so every AI surface reads one clean source of truth instead of six systems that quietly disagree. Most AI gives a confident answer built on conflicting inputs; the fix is upstream, in the data, not in the prompt.
A real-estate voice AI dialer running on Retell, Claude, and Deepgram, with script-version drift detection and automated transcript scoring, holding the false-positive rate on production enrichment — the fields it marked high-confidence and got wrong — at around 0.4% over the trailing window. That number is the whole game. It is the line between an operator who trusts the output and one who re-checks every field by hand.
And, for the question of whether fast and rigorous can coexist: an SMS and email automation platform shipped with 1,096 migrations, 298 edge functions, and 92% test coverage. Velocity and discipline are not a trade-off when the people doing the work have done it before.
Underneath all of it is the same reliability spine, and it is the actual “why us.” Anti-fabrication, so the system never invents a value it cannot find. Confidence labeling with abstention, so it says when it does not know instead of guessing with a straight face. And a structured review pass that runs every change past seven specialist lenses — correctness, security, performance, and four more — before it ships. That review is what catches the “almost right” before a customer ever sees it. The point of the whole apparatus is one thing: AI you can put in front of a customer, not a demo you can put in front of a board.
The shape of the engagement matters as much as the skill. A fractional team that has shipped before can start narrow — one workflow, scoped to a few weeks, with a number it has to move — and either earn the next scope or get cut cheaply. That is the opposite of the in-house bet, which commits to salaries and a roadmap before the first line ships and is painful to unwind when it stalls. Start small, measure against the P&L, expand only what works: the cadence is itself a way to dodge the 95%.
6. When you should build in-house — and how to partner so you’re not dependent
Partnering is the right default, not a permanent law. Here is the honest line, because pretending otherwise would fail the receipt test.
Build in-house when AI is your core product and your durable advantage — when the model and the system around it are the thing customers pay for, you should own that outright over time. Build when you already employ a senior engineer who has shipped reliable production AI and can hire and lead a team around that experience. Build when your horizon is measured in years and you are willing to fund the ramp and eat the early failures as tuition. Those are real cases, and for them the in-house path is correct even though it is slower and riskier at the start.
For most teams reading this, none of those hold yet, and the data says partner. You need it in production this quarter, not next year. The capability is internal plumbing, not your moat. You cannot staff the reps in time. When that is the situation, here is the order of operations that keeps a partnership from becoming a dependency:
First, pick one high-friction, measurable workflow — not “add AI,” but a specific bottleneck with a number attached, like a queue that eats ten hours a week or a cycle time you can quote. Second, demand production criteria from day one: evals, confidence labeling, abstention thresholds, anti-fabrication rules, and a review pass. If a partner leads with a slick demo and has no answer on those, that is the tell, and it is the same gap that sank the 95%. Third, measure to the P&L, not to vibes — hours recovered, files cleared, cycle time cut — so the thing either earns its keep or gets killed early. Fourth, structure the work to transfer: documentation, runbooks, your engineers in the loop, the system handed over in a state your team can run. The right engagement ships and leaves capability behind. You partner to get into production now and to learn how it is done, not to rent the same thing forever.
The pattern that most often wins is a hybrid, sequenced over time: partner to get into production and import the reps now, then hire to own the capability once it is proven and core enough to justify a permanent team. In that order the partner de-risks the build and trains the buyer, and the in-house team inherits a working system and a runbook instead of a blank repository and a deadline. Build-versus-buy is rarely permanent. It is a question of sequence — and the sequence that fails least starts with the people who have done it before.
What broke
A client came to us with an in-house build that demoed beautifully. It was an assistant that answered questions about their customers’ accounts by reading across their internal systems, and on stage it was sharp.
In production it had a quiet habit. When two of its source systems disagreed about a number — which they did constantly, because the data lived in six places that drifted apart overnight — the assistant picked one, stated it with full confidence, and never flagged the conflict. On a sample of real queries, it was confidently wrong on a meaningful share of exactly the cases where the underlying data conflicted. Nobody had caught it, because it never once said “I’m not sure.” A confident wrong answer is the one a human waves through, and theirs had been waving them through for months.
The fix was not a better model. We rebuilt it to read one consolidated source of truth, to cite the record behind every number it returned, and to abstain when the sources disagreed instead of guessing. The accuracy came from the system around the model and from a review pass whose entire job was to assume the output was wrong until it proved otherwise. That is the work the demo skips, the work that is hard to hire for, and the work that decides whether AI reaches production at all.
Where this goes
I’d bet a dinner that two years from now, the companies with AI actually in production will not be the ones that hired earliest or spent the most. They will be the ones that shipped a small, reliable thing first — usually with a partner who had done it before — measured it against the P&L, and built outward from a foundation that held. The 95% did not fail because the models were not good enough. They failed because shipping production AI is a craft, the craft takes reps almost no one has yet, and the reps were the part everyone assumed they could skip.
The demo is not the hard part. It was never the hard part. The hard part is the year after the demo, and that is the part worth being deliberate about who you do it with.
If you’re a COO or CTO weighing an in-house AI build against a partner, that decision is the teardown. Book the build-vs-buy teardown → (/book/teardown)
- Why do most AI projects fail?
- Not because the models are weak. MIT's 2025 NANDA research found 95% of enterprise generative AI pilots delivered no measurable P&L impact, and RAND found more than 80% of AI projects fail — twice the rate of IT projects that don't involve AI. The common causes are integration, data readiness, missing evaluation and reliability engineering, and leadership misjudging what AI can do — all the work around the model that a demo skips.
- Do AI coding tools actually make developers faster?
- It depends on the work. On clean, greenfield tasks, dramatically — GitHub's controlled study measured developers 55% faster with Copilot. On large, mature production codebases, often not: a 2025 METR randomized trial found experienced developers were 19% slower with AI tools, while still believing they were faster. AI compresses the prototype; it does not automatically speed up shipping reliable production code.
- How much does it cost to hire an AI engineer in 2026?
- The median total compensation for a US machine-learning or AI software engineer is about $244,500 (Levels.fyi), and AI skills carry a 56% wage premium over the same role without them (PwC) — more than double the premium a year earlier. Add four-plus months to hire and ramp, plus the data and review roles a single engineer can't cover, and the real cost of an in-house build starts well before anyone ships.
- Is it better to build AI in-house or hire an AI-focused agency?
- For most teams, partner first. MIT found AI bought from specialized vendors or built through partnerships succeeded about 67% of the time, versus roughly a third as often for internal builds. Build in-house when AI is your core product moat and you can staff senior production-AI experience. Partner when you need it in production this quarter and want the reliability engineering — evals, confidence labeling, anti-fabrication, structured review — from day one, then structure the work to transfer the capability to your team.
- BROKERAGES · THESIS
Real-estate agents adopted AI. They still won't trust it on a price.
82% of agents use AI (RPR) but 63% name output accuracy as their top concern. The gap isn't adoption anymore — it's trust, and trust is engineered, not prompted.
- MORTGAGE · THESIS
Mortgage AI got the document work. The judgment is where it's stuck.
AI in mortgage handles document classification (63%) and reading (54%) but underwriting decisions sit at 21%. The judgment gap isn't a model problem — it's a trust-engineering one.
- MORTGAGE · THESIS
The AI-lite problem: why most mortgage vendor AI stops at the LOS door
AngelAi CEO Pavan Agarwal named the pattern: 'AI lite — safe but shallow.' Most lender AI sits in chat or lead nurture, kept away from the LOS. Vendors who can't integrate sell it as caution.
If this reads like your problem, send a brief.
Two business days to first reply. No retainer pressure. Worst case you get a pointed question back.
Send a brief →