Carrier AI Projects Fail at the Audit Layer, Not the Tech

From reviewing AI governance documentation across six carrier implementations in the past year, the pattern is consistent: model performance artifacts exist, but auditability records (data lineage, bias logs, drift alerts) are incomplete or missing entirely. Carriers can demonstrate that their AI models work. They cannot demonstrate how they governed those models while they worked. That distinction separates a pilot from a production deployment in a regulated environment.

Grant Thornton's 2026 AI Impact Survey, fielded between February 23 and March 18 with 100 insurance-specific respondents among 950 total business leaders, quantifies the gap precisely. Forty-four percent of insurance executives report that governance or compliance challenges have contributed to AI project failure or underperformance. Only 24% say they are "very confident" their organization could pass an independent AI governance review within 90 days. The remaining 76% acknowledge fragmented controls, incomplete documentation, or untested response plans as impediments to auditability.

This article maps the audit-readiness gap from three angles: what independent AI audits actually test, how the NAIC's vendor registration framework creates a de facto audit standard taking effect regardless of state-by-state adoption timelines, and what it costs to build a governance program capable of passing scrutiny from scratch.

The 44/24 Gap: Why Governance Kills More AI Projects Than Technology

The Grant Thornton findings deserve unpacking beyond the headline numbers. Among insurance respondents, 68% report that AI controls exist but remain fragmented across teams and tools. Sixty-one percent of boards have established AI governance policies on paper. But only 20% have tested their response plans for AI failures. The gap between policy existence and operational provability is the core problem.

Tom Puthiyamadam, Grant Thornton's managing partner overseeing the survey, characterized the pattern: "Companies are making tremendous investments into AI and yet, we're not seeing that correlate with an increase in AI accountability." The survey also found that 73% of insurance respondents are piloting, scaling, or running autonomous AI systems. The combination of wide deployment with narrow accountability creates the audit exposure that 76% of respondents acknowledge.

Broader cross-industry context reinforces the insurance-specific findings. MIT Sloan's 2025 research documented a 95% generative AI pilot failure rate, with infrastructure costs running 3x to 5x initial projections at production scale. But in insurance specifically, the failure mechanism differs. Carriers rarely abandon AI projects because the model does not perform. They abandon them because the governance infrastructure required to prove regulatory compliance does not exist, or because the cost of retrofitting governance onto a production model exceeds the projected benefit of the model itself.

This pattern has repeated across multiple carrier deployments we have tracked. A pricing model achieves a 4-point improvement in loss ratio predictiveness during testing. The carrier moves toward production. The compliance team requests documentation of training data provenance, protected-class impact testing, and model change management procedures. The data science team has none of these artifacts in a form that satisfies regulatory scrutiny. The project stalls. Six months later, it is quietly shelved, reported internally as a "resource prioritization" decision rather than a governance failure.

What an Independent AI Audit Actually Tests

The distinction between "having AI governance" and "being audit-ready" collapses into six concrete testing domains. Each domain produces binary pass/fail outcomes when an independent examiner evaluates controls.

Model Documentation and Inventory

An audit begins with inventory completeness. The examiner requests a list of all AI/ML systems in production or development that touch consumer-facing decisions: pricing, underwriting, claims handling, fraud detection, marketing, and lead scoring. Mid-sized carriers (2,000 to 4,000 in-scope systems) consistently discover 20% to 40% more systems than initial estimates during formal inventory exercises. Shadow models built by individual actuarial teams, vendor-embedded scoring engines, and legacy rule systems that technically qualify as algorithmic decision-making expand the audit scope rapidly.

For each inventoried system, the audit expects a model card documenting architecture, training data sources and date ranges, intended use, known limitations, and version history. The model card format has become a de facto standard through the NAIC's AI Systems Evaluation Tool Exhibit C, which requires carriers to describe design, training data characteristics, performance metrics, and bias testing results for any AI system classified as high-risk.

Bias and Fairness Testing

The audit examines whether the carrier has conducted statistical testing for disparate impact across protected characteristics. Standard metrics include statistical parity, equal opportunity, calibration assessment, and counterfactual flip tests. The four-fifths rule (whether approval rates for any protected group fall below 80% of the highest-performing group) remains the baseline threshold, though more sophisticated carriers test equalized odds and calibration by subgroup.

From reviewing carrier bias testing programs, 15% to 30% of consumer-facing models get flagged in the first comprehensive pass. The flags do not necessarily indicate unlawful discrimination. They indicate that the carrier cannot affirmatively demonstrate the absence of disparate impact with documentation sufficient for regulatory review. The distinction matters: an auditor does not require proof of fairness in every model. The auditor requires proof that the carrier tested for fairness systematically and documented outcomes.

Proxy variable screening adds complexity. ZIP code correlates with race. Credit score correlates with income. Property age correlates with neighborhood demographics. The NAIC's AI Evaluation Tool Exhibit D specifically screens for proxies related to race, ethnicity, social media data, and aerial imagery that may correlate with protected characteristics. Carriers that price using these variables without documented proxy testing face audit findings even when the variables are actuarially justified.

Model Performance and Drift Detection

An audit tests whether the carrier monitors deployed models for performance degradation over time. The industry-standard thresholds that trigger remediation action include: a 5% accuracy change from the validated baseline, a four-fifths rule violation emerging on any monitored protected class, and distribution shift in key input features exceeding predefined statistical bounds.

Most carriers have monitoring for the first threshold (accuracy). Few monitor the second (emerging bias) or third (input drift) with the same rigor. The audit finding is predictable: model monitoring exists for performance metrics that affect profitability but not for fairness metrics that affect regulatory compliance.

Data Lineage and Governance

Every model in scope requires documentation of where training data originated, how it was processed, what transformations were applied, and whether data collection complied with applicable privacy regulations. The audit trail must include input features, model version, output, confidence indicators, timestamp, and downstream action for each decision the model influenced. Regulatory expectations indicate a 7-year retention period aligned with standard examination cycles.

This is where most carriers fail. Data science teams build models using datasets assembled from multiple internal and external sources. The assembly process rarely documents lineage at the granularity an auditor requires. Reconstructing lineage retroactively for a model already in production is expensive, sometimes impossible, and always raises questions about whether the reconstructed documentation accurately reflects the actual data pipeline.

Explainability and Reproducibility

The auditor tests whether the carrier can explain, in terms a regulator understands, why a specific model produced a specific output for a specific policyholder. For linear models and decision trees, this is straightforward. For gradient-boosted ensembles and neural networks, it requires SHAP values, LIME explanations, or comparable post-hoc interpretability tools applied consistently across the model's operating history.

Reproducibility adds another layer: can the carrier rerun the model on identical inputs and produce identical outputs? Stochastic elements in training, software version dependencies, and hardware-specific floating-point behavior all complicate reproducibility. An audit finding of "non-reproducible model outputs" is not uncommon for carriers running deep learning systems on GPU clusters without strict environment pinning.

Change Management and Incident Response

Finally, the audit evaluates the carrier's procedures for model updates, retraining, and failure response. Questions include: Who authorizes model changes? What testing gates exist before production deployment? What happens when a model produces an anomalous output? Who is notified? What is the rollback procedure? Has the response plan been tested?

The Grant Thornton finding that only 20% of carriers have tested AI failure response plans directly addresses this domain. An untested plan is an audit finding. It does not matter that the plan exists in a policy document. The auditor looks for evidence of execution: tabletop exercises, documented false alarms that triggered the response chain, or post-incident reviews from actual model failures.

The NAIC Vendor Registration Framework as a De Facto Audit Standard

On March 23, 2026, the NAIC's Third-Party Data and Models (H) Working Group advanced its vendor registration framework at the Spring National Meeting in San Diego. The framework moves toward public exposure in Q3 2026, with adoption consideration at the November 2026 Fall Meeting and first state implementations expected in late 2026 or early 2027.

The registration requirement applies to vendors whose models or data are used in consumer-facing insurance decisions: pricing, underwriting, claims handling (coverage decisions and reserve estimation), utilization review, marketing and lead scoring, and fraud detection. Vendors must disclose model descriptions and intended use, training data sources and date ranges, documented testing methodology including bias testing, known limitations, change-management practices, and a contact for regulator inquiries. Annual attestation of adherence to governance program requirements creates ongoing compliance obligations.

The critical distinction: registration creates no safe harbor for insurers. Carriers remain fully accountable for vendor model behavior. The 2023 NAIC Model Bulletin already requires insurers to apply "the same rigor" to third-party models as internally developed systems. The vendor registry operationalizes that requirement by giving regulators a baseline to compare against carrier due diligence files.

For carriers, the due diligence file must include: a model card covering architecture, training data, intended use, known limitations, and version history; the vendor's bias testing methodology and results; sub-processor disclosure identifying every entity handling training data, hosting the model, or contributing to inference; incident-response and notification commitments; written monitoring metrics commitments; and exit and continuity provisions.

The practical effect is clear. Even before the vendor registry becomes mandatory in any state, the framework's disclosure categories define what an examiner expects to find in a carrier's vendor oversight records. Carriers that wait for formal state adoption before building these artifacts face a compressed implementation timeline when the requirement activates. Carriers that build now treat the framework as a roadmap rather than a deadline.

The 12-State Evaluation Pilot Adds Operational Pressure

Running in parallel with the vendor registration framework, the NAIC launched an AI Systems Evaluation Tool pilot on March 2, 2026, across 12 states: California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin. The pilot runs through September 2026, with nationwide adoption expected at the November 2026 Fall Meeting.

The evaluation tool comprises four exhibits. Exhibit A requests an AI Usage Inventory. Exhibit B requires a Governance Risk Assessment. Exhibit C demands detailed information on High-Risk AI Systems including design, training data, and performance. Exhibit D probes AI Data Details with specific attention to proxy variables for protected characteristics.

Carriers operating in pilot states that receive a regulatory request must respond substantively. The evaluation tool is not a suggestion; it is a market conduct examination instrument. Carriers unable to populate Exhibits C and D with existing documentation face the choice between rapid remediation and regulatory non-response, neither of which is attractive.

The 12-state pilot geography is strategically significant. California, Colorado, and Connecticut already have state-specific AI governance requirements beyond the NAIC Model Bulletin. Florida and Pennsylvania are among the largest premium markets. The pilot captures carriers representing a substantial majority of U.S. written premium, ensuring that the governance standards become operationally relevant for any carrier of meaningful scale.

Governance Maturity: Dedicated Teams vs. Bolted-On Compliance

From tracking carrier governance structures, two distinct organizational models have emerged. The first model creates a dedicated AI governance function with a single accountable executive reporting to the CEO or COO, dotted-line authority across the CIO, CRO, Chief Compliance Officer, General Counsel, and Chief Data Officer, and a standing operations team maintaining the model inventory. The second model distributes AI governance responsibilities across existing enterprise risk management functions, treating AI as another risk category within an existing framework.

The Grant Thornton data suggests the dedicated-team model correlates with audit readiness. Organizations with "fully integrated AI" (which implies dedicated governance structures) are nearly 4x more likely to report AI-driven revenue growth (58% vs. 15% for those still piloting). The mechanism is intuitive: dedicated governance teams build the audit artifacts concurrently with model development rather than retrospectively.

The bolted-on model produces a specific failure pattern. Enterprise risk teams have expertise in financial risk, operational risk, and compliance risk. They lack expertise in the technical specifics of AI governance: what constitutes adequate bias testing, how to evaluate data lineage completeness, what drift detection thresholds are appropriate for different model types. The result is governance documentation that satisfies internal policy requirements but fails external audit scrutiny because the documentation reflects risk management language rather than AI-specific technical substance.

Carriers in the dedicated-team model typically employ the following organizational structure: a Chief AI Officer or equivalent ($264K to $494K compensation, median approximately $353K), a VP of AI Governance ($190K to $280K), directors of AI governance for each major business function ($190K to $250K), and AI governance analysts embedded within data science teams. The total headcount for a mid-sized carrier ranges from 8 to 15 full-time equivalents dedicated to AI governance.

Cost of Building an Audit-Ready Governance Program

For a mid-sized carrier with 2,000 to 4,000 in-scope AI systems, building an audit-ready governance program from scratch requires $4 million to $8 million in incremental spending across personnel, infrastructure, and outside counsel. The timeline and cost breakdown by workstream follows a predictable sequence.

Workstream	Timeline	Key Outputs
Model Inventory	6-8 weeks	Complete registry of all AI/ML systems, risk classification, owner assignment
Bias Testing (First Pass)	10-14 weeks	Statistical parity, four-fifths rule, proxy screening for all high-risk models
Drift Monitoring Dashboards	6-10 weeks	Automated alerts for accuracy degradation, distribution shift, emerging bias
Audit Trail Specification	2-4 weeks	Logging requirements, retention policy, reproducibility standards
Vendor Contract Amendments	2 weeks	Disclosure requirements, bias testing access, incident notification SLAs
Organizational Standup	8-12 weeks	Executive hire, team build, RACI assignment, reporting cadence

The model inventory workstream consistently reveals scope surprises. The 20% to 40% system discovery overshoot means that carriers budgeting for 2,000 systems should plan governance infrastructure for 2,400 to 2,800. Each additional system requires risk classification, owner assignment, and documentation assessment. The overshoot alone can add $500K to $1M to the program cost.

Retrofitting governance onto existing ML pipelines costs more than building it into new deployments. The retrofit premium exists because existing systems lack instrumented logging, because training data provenance must be reconstructed from version control history and team memory rather than recorded at ingestion time, and because model change histories must be assembled from deployment logs rather than formal change management records. Carriers with mature MLOps practices (automated training pipelines, model registries, feature stores) face lower retrofit costs than those running ad hoc model development processes.

For carriers building new AI systems with governance embedded from the start, the incremental cost is substantially lower: approximately 15% to 20% of total model development cost added for governance instrumentation, documentation automation, and testing infrastructure. This delta explains why carriers that delay governance investment face escalating costs as their model inventory grows.

The Actuarial Role in AI Governance

Actuaries occupy a structurally advantageous position in AI governance because they already perform model validation as a core professional function. ASOP No. 56 (Modeling), effective since October 2020, requires actuaries to evaluate model appropriateness for intended use, assess data quality, test model performance, ensure appropriate governance and controls, and disclose material limitations. These requirements apply to all model types, including AI and ML systems.

The ASOP No. 56 Fourth Exposure Draft currently in development broadens the standard to explicitly accommodate predictive and statistical modeling. New provisions require evaluation of "reasonable model in the aggregate" and mandate specific validation activities: historical backtesting, statistical tests, sensitivity analysis, and comparison with alternative models. For actuaries validating AI systems, these provisions create professional obligations that align directly with audit expectations.

ASOP No. 38 (Using Models Outside the Actuary's Expertise) adds another dimension. Its Second Exposure Draft expands from P&C to all practice areas and directly addresses the scenario where actuaries rely on AI/ML models developed by data scientists. The standard requires actuaries to understand the model's general approach, assess whether it is appropriate for the intended actuarial purpose, and document the basis for reliance.

The appointed actuary's role in governance is particularly significant. For carriers where AI models influence reserve estimates, pricing indications, or risk-based capital calculations, the appointed actuary's statement of actuarial opinion implicitly encompasses the governance of those models. An appointed actuary who signs an opinion relying on AI model outputs without verifying that adequate governance exists faces professional liability exposure under ASOP No. 41 (Actuarial Communications) and potentially under Precept 1 of the Code of Professional Conduct.

From tracking how governance teams interact with actuarial functions, the most effective model assigns the Chief Risk Officer or a dedicated AI fairness officer ownership of bias testing methodology, while actuaries retain responsibility for model validation, ASOP compliance documentation, and the connection between model outputs and regulatory filings (rate indications, reserve opinions, risk-based capital computations). This division respects actuarial expertise in model assessment while acknowledging that fairness testing requires statistical and legal expertise beyond traditional actuarial training.

Why the Audit Layer Becomes the Binding Constraint in 2026

Three regulatory timelines converge in the second half of 2026 that elevate the audit layer from a compliance checkbox to the binding constraint on carrier AI deployment.

First, the EU AI Act classifies insurance risk assessment as high-risk under Annex III, with enforcement beginning August 2, 2026. The Act imposes penalties up to 35 million euros or 7% of worldwide revenue for non-compliance. Carriers with European operations face mandatory bias testing, explainability documentation, and audit trail requirements that exceed current U.S. standards. Multi-jurisdictional carriers building governance programs must satisfy both EU and NAIC requirements simultaneously.

Second, the NAIC's 12-state evaluation pilot creates examination exposure for carriers in participating states through September 2026. Carriers receiving pilot requests must demonstrate governance capabilities or acknowledge gaps to their primary regulator. The pilot is explicitly designed to identify governance maturity levels across the industry, meaning that carriers with weak responses face heightened supervisory attention.

Third, the vendor registration framework's public exposure in Q3 2026 signals the governance expectations that will become mandatory. Forward-looking carriers treat the exposure draft as a compliance roadmap. Carriers that wait for final adoption face implementation timelines that may exceed the effective date.

Colorado's AI Act, with its June 30, 2026 insurance compliance deadline, adds state-specific pressure. Nearly 25 states have now adopted the NAIC Model Bulletin in some form. California, Colorado, New York, and Texas have enacted or proposed specific AI regulations beyond the bulletin framework. The regulatory environment is not waiting for carriers to achieve governance maturity before imposing accountability requirements.

The compound effect: by Q4 2026, a mid-sized multi-state carrier faces potential examination under the NAIC evaluation tool, compliance obligations under one or more state AI laws, vendor disclosure requirements under the forthcoming registry, and professional standards obligations under ASOP No. 56. Each of these examinations tests the same underlying governance artifacts. The carrier that has invested in audit-ready governance satisfies all four simultaneously. The carrier that has not, fails all four simultaneously.

What This Means for Actuarial Practice

The audit-readiness gap reshapes actuarial work in three concrete ways.

First, model validation workloads expand. Every AI system touching actuarial outputs (pricing, reserving, capital modeling) requires validation documentation at a standard that satisfies both ASOP No. 56 and the NAIC evaluation tool exhibits. Actuaries who historically validated one or two models per year for rate filings now face validation queues of 10 to 20 AI-augmented systems requiring concurrent documentation.

Second, the appointed actuary's opinion gains AI governance implications. An actuary signing a statement relying on AI-influenced reserves or pricing indications implicitly attests to the adequacy of governance over those models. This requires either direct verification of governance artifacts or documented reliance on an AI governance function under ASOP No. 38 expert reliance provisions.

Third, consulting opportunities expand. The $4M to $8M governance build-out for mid-sized carriers creates demand for actuaries with model validation expertise who can bridge the gap between data science teams (which build models) and compliance teams (which write policies). Actuaries positioned at that intersection combine technical model assessment skills with regulatory filing experience and professional standards knowledge that neither pure data scientists nor pure compliance professionals possess.

The carriers that recognize governance as a competitive advantage rather than a compliance cost will deploy AI faster, maintain regulatory relationships, and avoid the project abandonment cycle that the Grant Thornton data documents. The binding constraint is not whether AI works in insurance. It does. The binding constraint is whether carriers can prove, to an independent examiner's satisfaction, that they governed their AI responsibly while it worked. That proof is the audit layer, and it is where 76% of the industry falls short.