Insurance AI Hits the Pilot-to-Portfolio Wall

Only 7% of insurance AI initiatives advance beyond the pilot phase, and fewer than half of carriers have deployed AI in a single operational function (BCG; Simplifai, April 2026). The failure pattern is consistent: pilots run on curated data and narrow claim types; portfolio deployment surfaces model drift, leakage, and adverse selection that controlled tests were never designed to detect.

The implementation failures in this pattern usually appear after the pilot metrics have already been celebrated. A model that scores well on a handpicked segment of claims data is not a model that scores well on an entire book. The distance between those two statements is where most carrier AI investment currently sits.

The Gap Between Adoption and Deployment

The industry's AI adoption numbers look expansive. Eighty-eight percent of private-passenger-auto carriers use, plan to use, or are exploring AI or machine learning models in their underwriting and pricing (NAIC survey, 2025). More than 81% of insurance companies dedicate at least $5 million annually to AI initiatives; 14% spend over $50 million per year (Simplifai, April 2026). The spending has continued regardless of deployment results.

"Carriers have run hundreds of pilots, spent real money, and in most cases have little production deployment to show for it," concluded a Carrier Management analysis of the implementation landscape (Carrier Management, June 2026). Production-scale AI remains concentrated in customer service chatbots and document summarization; end-to-end workflow automation in underwriting and claims represents the least common deployment form across the sector. The 88% adoption figure and the fewer-than-half deployment figure describe two genuinely different things: industry exploration and production operation are not on the same continuum.

The 2026 Evident AI Index for Insurance captures the same asymmetry from the capability side. Twenty of the 30 major insurers it scored now publicly report at least one AI use case with documented outcomes, up eight from the prior year. But 49% of those disclosed use cases remain narrow in scope, targeting speed, cost reduction, and process efficiency rather than risk selection or loss quality (Evident Insights, June 2026). A carrier that processes 40% of its personal auto claims without human review saves two to three minutes of adjuster time per claim; it does not improve its combined ratio unless the automated disposition accuracy on those claims matches or exceeds the accuracy of the prior manual process. Establishing that equivalence is a measurement problem, and most carrier implementations have not solved it.

Three Reasons Pilots Overstate Lift

The three mechanisms appear in virtually every carrier AI implementation review and are not difficult to identify in retrospect. They are difficult to avoid in advance because they are embedded in how pilots are designed.

Data selection. Pilots select the cleanest segment of an insurer's data: standardized commercial auto submissions with complete telematics records, water damage claims below a defined severity threshold, personal lines renewals with three or more years of continuous coverage history. That is exactly where AI performs best, and it is not representative of the full portfolio. Complex commercial property, specialty liability, and catastrophe-exposed homeowners bring data heterogeneity, missing fields, and thin training signal. Extending the model from the pilot segment to those classes introduces distributional shift, where the training data no longer reflects the production population and accuracy degrades without any model failure in the conventional sense.

The vendor framing of this problem is often more candid than the carrier framing. "A demo that works against sample data is not the same as a solution that works against a Guidewire implementation with 12 years of customization on top of it," one analysis of the carrier-vendor gap observed (Carrier Management, June 2026). Those customization layers carry the historical data exceptions, legacy field definitions, and non-standard coding decisions that clean demo data never includes.

Volume credibility. A 500-claim pilot does not carry actuarial credibility. Partial credibility weighting for a pure premium indication on 500 claims of moderate severity assigns roughly 30 to 40 percent weight to the observed data; the remainder is carried by the complement of prior or industry experience. A carrier reporting a 10-point loss ratio improvement from a pilot of that size is reporting a result that could fall entirely within the range of statistical noise. The threshold for meaningful AI performance attribution in underwriting -- 3,000 or more exposure units across at least two full policy years -- is not an arbitrary standard. It reflects the credibility weight at which observed experience begins to dominate the complement. If a carrier's pilot measured a 10-point improvement on 500 claims but partial credibility pulls the indicated improvement toward 3 to 4 points, the actual AI-attributable result is 3 to 4 points. A board presentation based on the raw pilot figure without credibility weighting overstates the likely portfolio effect by a factor of two to three. Most carrier AI announcements do not disclose the credibility weight assigned to the pilot data.

Self-selection. Pilot models are built and tested on business that existing underwriting rules have already accepted. That is a filtered sample, biased toward risks where the carrier has pricing confidence and stable historical data. Results on that population overstate what the model produces when extended to the margins of the acceptable book or to submission categories the current pricing structure has been declining. When BCG found only 7% of insurance AI initiatives reach portfolio scale, the attrition is not random. Models that succeed in the pilot cohort fail disproportionately at extension because the extension encounters exactly the segments excluded from the pilot: the complex commercial submission, the claim with disputed liability, the policy with gaps in historical continuity.

What Portfolio Deployment Actually Encounters

Five dimensions separate the deployment environment from the pilot environment. Defining them explicitly before launch is the operational discipline that separates a credible AI deployment from one designed for announcement.

Dimension	Pilot Condition	Portfolio Condition
Data quality	Curated, complete, standardized	Heterogeneous, missing fields, legacy codes
Volume	200 to 1,000 claims or submissions	Full book, all complexity bands
Model drift	None (static training set)	Continuous; requires defined monitoring cadence
Override tracking	Typically not captured	Must be tracked and analyzed as diagnostic data
Adverse selection	Not present (controlled test)	Present wherever model rejects or re-prices risks

Model drift. A model trained on 2023 to 2024 commercial auto data degrades as litigation trends, repair cost inflation, and driver behavior shift. Performance monitoring requires a defined drift threshold -- the loss-ratio delta at which the model is flagged for retraining or override -- set before deployment and tracked on a quarterly basis. Without a predefined threshold, drift is discovered after it has already affected reserving assumptions on an active book, not before.

Override tracking. Underwriters and adjusters routinely override AI recommendations in complex cases. Those override rates are diagnostic data. An adjuster pool overriding 40% of claims AI recommendations is communicating something specific about model accuracy, workflow fit, or training on unrepresentative cases. Leakage analysis should compare closed claim outcomes where the AI recommendation was followed against outcomes where it was overridden, stratified by claim complexity and severity band. Most carrier AI implementations are not capturing this comparison.

Adverse selection. An AI-driven underwriting tightening that improves the loss ratio on the bound book does not exist in isolation. Risks the model rejects or re-prices move to competing carriers or to the state-assigned risk mechanism. If the model's rejection criteria correlate with protected class characteristics, or if the exited risks concentrate in specific territories, the carrier faces adverse selection exposure in retained categories and potential regulatory examination of its declination patterns. Both require active monitoring before the model's governance program is complete. The adverse selection risk compounds with market concentration: a carrier with a 15% market share in a territory that tightens via AI removes a material volume of business from the admitted pool, concentrating residual risk across every carrier that does not tighten simultaneously.

Leakage. Claims AI can suppress leakage -- payments above the true liability -- or introduce it by settling ambiguous cases too quickly on incomplete evidence. The direction depends on model design and training data. Neither outcome appears in pilot metrics unless the pilot specifically compared outcomes across complexity bands and settlement timing categories, with both the metric and the lag definition held constant.

Operational compliance. Forty-four percent of insurance executives report governance or compliance challenges contributed to AI project failure or underperformance; only 24% said their controls could survive an independent audit (Grant Thornton, 2026). The compliance layer includes fair lending analysis, adverse action notices required under state bulletins, and the vendor contracts governing who owns and audits the model when a claim decision is challenged in regulatory examination.

Holdout Groups, Baseline Periods, and the Credibility Standard

An actuarially defensible AI implementation test requires three design decisions made before the model enters production.

Define the counterfactual. The counterfactual is what would have happened without the AI model: the prior loss ratio on the segment, the prior close rate, the prior expense per claim. It requires a clean baseline period covering the twelve to twenty-four months immediately before deployment, with the same mix of risk class, territory, and policy limits as the deployment period. If the carrier changed its pricing structure, shifted distribution channels, or tightened its appetite during that baseline window, those changes must be controlled before attributing any outcome difference to the model.

Define the holdout group. A holdout group routes a random sample of eligible cases through existing processes rather than AI-augmented processes during the deployment period. Underwriters and adjusters should not know which submissions or claims are in the holdout; known control cases generate different handling behavior and contaminate the comparison. The minimum viable holdout is 15 to 20 percent of eligible volume, maintained consistently across the observation period. The precise allocation depends on the expected effect size and the target statistical power: for an underwriting AI claiming a 5-point loss ratio improvement, producing 85% statistical power at a two-sided alpha of 0.05 requires roughly 1,200 to 1,500 policy years of holdout experience in a personal lines book with typical variance. Most pilots do not specify a target statistical power before launch, which means the observed result cannot be confirmed or disconfirmed -- it is simply ambiguous.

Define the observation window. For underwriting AI, the observation window must extend through at least one full policy year for short-tail lines and two years for long-tail lines to develop loss experience to a credibility level that separates model effect from seasonal and cyclical variation. For claims AI, outcomes should be observed at least six months post-closure to capture reopened claims, subrogation recoveries, and late-emerging allocated loss adjustment expense. A 90-day AI improvement on a long-tail liability book is a preliminary signal, not a deployable finding. Treating preliminary signals as confirmed findings is precisely the mechanism by which pilot lift disappears at portfolio scale.

Model Ownership, Vendor Reliance, and the Audit Trail

The governance question most implementations underaddress is model ownership. When a carrier deploys a vendor AI model for claims triage, underwriting scoring, or fraud detection, the regulatory obligation to document the model's inputs, outputs, and governance structure rests with the carrier, not the vendor.

The NAIC Model Bulletin on the Use of AI Systems, adopted in 24 states and Washington, D.C. as of 2026, requires insurers to establish written programs governing responsible AI use, including governance frameworks, accountability structures, and bias testing protocols (NAIC, adopted December 2023; multiple state adoptions through 2025). The bulletin does not differentiate between proprietary and licensed vendor AI: if the model touches a regulated insurance decision, the insurer carries the documentation obligation.

The audit trail translates that obligation into four specific records for every AI-influenced decision subject to regulatory examination: the model version active at decision time, the inputs that generated the output, the governance sign-off that approved that model version for production, and the override record if a human departed from the model recommendation. Those four elements are not default deliverables in most vendor SaaS contracts. They are negotiated terms, and carriers that do not negotiate them discover the gap during examination rather than before it.

The Regulatory Examination Is Already Running

The NAIC AI Systems Evaluation Tool is in a 12-state pilot running January through September 2026, with formal adoption at the Fall National Meeting as the expected pathway (NAIC, 2026). Participating states include Colorado, Maryland, Louisiana, Virginia, Connecticut, Pennsylvania, Wisconsin, Florida, Rhode Island, Iowa, Vermont, and California. The tool evaluates the extent of AI operational use, governance and risk mitigation practices, high-risk model identification, and input data types.

The current tool focuses on governance documentation: does the insurer have a written program, accountability structures, and testing records? The predictable next phase is performance attribution: are the AI models documented as active in underwriting or claims actually performing as their governance documents describe? That question requires holdout data, baseline comparisons, and drift monitoring records -- data that is straightforward to produce if the carrier built the measurement infrastructure before deployment and effectively unavailable if it did not.

"Carriers are not resistant to change. They are resistant to being burned again," one analysis of the carrier technology relationship observed (Carrier Management, June 2026). The measurement infrastructure described above is not a constraint on AI deployment. It is the mechanism by which this deployment cycle produces results that survive an actuarial review, a board presentation, and a market conduct examination -- and that separate the carriers building durable AI capability from those running pilots in perpetuity.

Sources

"Insurance AI: What You Won't Read in the Press Releases," Carrier Management, June 19, 2026. carriermanagement.com
"Insurance industry still stuck in AI pilot phase, report finds," CIO Dive, 2026. ciodive.com
"2026 Evident AI Index for Insurance Key Findings Report," Evident Insights, June 2026. evidentinsights.com
NAIC Insurance Topics: Artificial Intelligence, National Association of Insurance Commissioners. Includes the Model Bulletin on the Use of AI Systems (adopted December 2023) and the AI Systems Evaluation Tool pilot details. content.naic.org
"Nearly Half of States Have Now Adopted NAIC Model Bulletin on Insurers' Use of AI," Quarles Law Firm, 2026. quarles.com
"How the NAIC AI model bulletin is evolving and why insurers should prepare now," Plante Moran, March 2026. plantemoran.com