Three months into the NAIC's 12-state AI Systems Evaluation Tool pilot, regulators have gathered the first structured inventory of how insurers actually deploy AI in production. The Big Data and Artificial Intelligence (H) Working Group's June 1, 2026 public meeting surfaced early patterns in P&C and life insurer submissions, revealing which applications regulators classify as high-risk and how the proportionality principle operates when applied to real companies with real model inventories. Having reviewed every public comment filed against the NAIC's AI tool exposure draft, the pattern is striking: insurers pushed back hardest on Exhibit C's requirement to detail high-risk model inputs, not on the governance questions in Exhibit B. That tells you where the operational burden actually falls. This article analyzes the mid-pilot findings, what they reveal about the state of insurer AI deployment, and what the data means for the expected Fall 2026 adoption vote.
What the June 1 Meeting Revealed
Pennsylvania Insurance Commissioner Michael Humphreys, who chairs the Working Group, opened the June 1 session with a progress update on the pilot that has been running since March 2026 across California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin. The meeting included two substantive components: a mid-pilot status briefing from participating state regulators and an actuarial panel on AI governance trends that featured practitioners from consulting firms and carrier compliance teams.
The actuarial panel focused on how governance frameworks interact with the evaluation tool's exhibit structure. Panelists described a gap between the governance documentation carriers maintain internally and the format regulators need for examination purposes. One recurring theme: companies that invested in model risk management frameworks modeled on the banking sector's SR 11-7 guidance (now replaced by SR 26-2) found the transition to the NAIC's exhibit format more straightforward than companies building governance documentation from scratch. The Working Group scheduled a follow-up actuarial panel for July 22, signaling that the intersection of actuarial practice and AI governance will remain a focus through the pilot's conclusion.
The most informative portion of the meeting was the mid-pilot briefing itself, which for the first time provided a structured picture of what AI applications insurers are actually reporting to regulators. The submissions are confidential under each participating state's examination authority, but the aggregate patterns discussed publicly paint a clearer picture than any industry survey has managed to date, because these are regulatory filings, not self-reported marketing data.
P&C Insurer AI Submissions: Breadth Across the Value Chain
P&C insurers participating in the pilot reported AI use cases spanning marketing, underwriting, pricing, and claims. From tracking insurer AI disclosures across earnings calls, 10-K filings, and vendor announcements for the past 18 months, the pilot submissions confirm what the public data suggested but add granularity that was previously unavailable.
Marketing. Several P&C carriers disclosed AI-driven targeted advertising systems that use behavioral and demographic data to serve personalized ads to prospective policyholders. These systems were generally classified as lower-risk under the proportionality framework because they do not directly affect coverage decisions, pricing, or claims outcomes. However, regulators flagged that marketing AI intersects with fair lending and unfair trade practices statutes when targeting algorithms correlate with protected class characteristics. The distinction between "marketing" and "underwriting" blurs when the same data signals used to target ads also feed into quote eligibility logic, a concern Colorado's Department of Insurance has raised explicitly in its parallel state AI compliance framework.
Underwriting. Reported underwriting applications fell into two clusters. The first involves renewal evaluation models that flag policies for non-renewal or premium adjustment based on loss history, property condition changes, and external data feeds. The second involves AI-powered property inspection tools, primarily aerial and satellite imagery analysis that assesses roof condition, vegetation encroachment, and structural changes without requiring a physical visit. This second category has drawn its own regulatory attention: thirteen state insurance departments have now issued bulletins specifically governing aerial imagery AI in homeowners underwriting. Pilot submissions confirmed that multiple carriers use third-party vendors for this capability, raising the vendor oversight questions that Exhibit B is designed to surface.
Pricing. Pricing-related AI submissions included machine learning risk scoring models and algorithmic rate factor relativity engines. These are the models most directly relevant to actuarial practice. Carriers described gradient-boosted tree models and neural network architectures generating individual risk scores that feed into filed rating algorithms. The distinction between "AI in pricing" and "traditional GLM-based rating" is not clean in practice: several carriers reported hybrid systems where ML models generate risk features that are then consumed by GLM structures in filed rate plans. From a regulatory standpoint, the question is whether the ML component is documented sufficiently for rate filing review. The NAIC's rate filing scrutiny track runs parallel to the evaluation pilot, and the intersection of these two workstreams is where the most consequential compliance questions sit for pricing actuaries.
Claims. Claims AI applications were the most varied category in P&C submissions. Carriers reported accident image analysis systems (primarily auto physical damage photo assessment), ultimate claim estimation models that project incurred loss development, and fraud detection algorithms that flag suspicious patterns at first notice of loss or during investigation. The Working Group's Spring 2026 decision to flag claims handling for additional regulatory scrutiny reflected early pilot observations that claims AI frequently operates with limited human review, particularly in low-severity auto and property claims where straight-through processing rates have reached 70% at some carriers. The proportionality principle directs regulators to focus on these high-automation claims workflows rather than AI tools that assist adjusters without making autonomous decisions.
Life Insurer AI Submissions: Faster Decisions, Higher Stakes
Life insurer submissions revealed a narrower but higher-stakes set of AI applications concentrated in the underwriting and policy issuance pipeline. Three categories dominated the life side of the pilot data.
Policy issuance acceleration. Multiple life insurers reported AI systems designed to fast-track policy issuance by automating the review of application data, medical records, and third-party data sources. These accelerated underwriting pipelines reduce the time from application to binding from weeks to minutes for applicants whose profiles fall within predefined parameters. The AI does not eliminate underwriting judgment; it identifies which applicants can bypass full manual review based on risk scores derived from pharmacy benefit manager data, motor vehicle records, credit-based insurance scores, and electronic health records. Regulators flagged these systems for proportionality review because the speed of automated decisions can outpace the consumer's ability to understand and contest an adverse action.
Approval and denial decisions. A subset of life insurer submissions described AI models that contribute directly to approval and denial outcomes. This is the category that draws the most regulatory scrutiny under the evaluation tool, because a model that influences whether an applicant receives coverage at all has the highest possible consumer impact. The NCOIL Model Act's "qualified human professional" requirement, if adopted, would require a human with statutory authority to independently review any AI-influenced denial. The pilot data suggests that most life insurers maintain human review for denials but allow AI-driven approvals to proceed with less oversight, a structural asymmetry that the Working Group is likely to examine more closely in coming months.
Underwriting risk class assignment. AI-assisted risk classification, where algorithms assign applicants to preferred, standard, or substandard risk tiers, appeared across several life insurer submissions. These systems use the same data sources as the accelerated underwriting pipelines but apply them to a finer-grained classification decision. The actuarial implications are significant: if an ML model assigns risk classes, the mortality assumptions underlying product pricing and reserving are downstream of that model's accuracy. A bias or calibration error in the classification model does not just affect individual applicants; it compounds through pricing margins, reserve adequacy, and experience study results. VM-20 prescribed assumptions do not yet explicitly address the scenario where risk class assignment itself is algorithmically determined.
How the Proportionality Principle Works in Practice
The proportionality principle is the operational mechanism that prevents the evaluation tool from becoming an unbounded audit of every algorithm an insurer runs. In concept, proportionality means that examiner resources concentrate on AI systems with direct consumer impact and material financial consequences, while low-risk back-office automation receives lighter or no scrutiny. In practice, the mid-pilot data shows how this principle is being applied.
Exhibit A, the AI inventory, functions as the triage layer. When a carrier submits its model count broken down by use case, consumer impact, and financial materiality, the reviewing regulator uses that inventory to identify which models warrant deeper investigation through Exhibits B, C, or D. A chatbot that helps internal staff navigate the employee handbook does not trigger escalation. A fraud detection model that automatically routes claims to a special investigations unit does.
The pilot data revealed that regulators are drawing the proportionality line roughly along three dimensions. First, does the AI system affect a coverage decision (issuance, renewal, denial, or cancellation)? Second, does it operate with limited or no human review before its output reaches the consumer? Third, does the system use external data sources whose provenance is not fully controlled by the insurer? A "yes" on any two of these dimensions appears sufficient to trigger Exhibit C scrutiny in most participating states, though the Working Group has not formalized a bright-line rule.
What the proportionality framework means for carriers in practical terms is that a company running 200 AI models might have only 15 to 30 flagged for deeper review, depending on its line mix and degree of automation. That is a meaningful reduction in compliance burden compared to a hypothetical framework that treats all models equally. But it also means those 15 to 30 models will face a level of documentation scrutiny that most carriers have not previously experienced outside of rate filing proceedings.
Exhibit C: Where the Real Compliance Burden Falls
The joint industry letter filed in December 2025 by six trade associations (ACLI, APCIA, NAMIC, AHIP, RAA, and the Big "I") raised five objections to the pilot structure, and our analysis of those objections at the time focused on the asymmetry between voluntary state participation and effectively compulsory company participation. Three months into the pilot, the operational reality has sharpened the picture: Exhibit C is where the compliance work concentrates.
Exhibit C asks for detailed information on each model the insurer classifies as high-risk. The required disclosures include how the model was developed, what data was used for training, what testing was performed (including bias testing), the degree of human-in-the-loop involvement, the model monitoring cadence, and the documentation supporting ongoing compliance. For a carrier with 20 high-risk models, each requiring this level of detail, the Exhibit C response becomes a multi-week project involving actuarial, data science, IT, legal, and compliance teams.
The pushback on Exhibit C has centered on two specific concerns. First, the "high-risk" classification is currently self-reported by the insurer, which creates a strategic tension: classify too few models as high-risk and risk a regulatory challenge to the methodology; classify too many and face an exponentially larger documentation burden. Several insurers have told regulators that they need clearer guidance on the boundary between high and medium risk before the tool is finalized. The four-tier risk taxonomy the Working Group proposed at the Spring 2026 meeting is the NAIC's attempt to provide that clarity, but it remains in draft form.
Second, Exhibit C's requirement to detail model inputs raises trade secret and competitive sensitivity concerns. A pricing model's feature set, the specific variables it weights most heavily, and the data sources it consumes are competitive differentiators. Insurers have argued that disclosing this level of detail to a regulator, even under examination confidentiality protections, creates exposure if materials are subject to state freedom-of-information requests or shared across state lines through the NAIC's Market Analysis procedures. The Working Group has responded by emphasizing that all information collected through the pilot is protected under the administering state's confidentiality statutes, and that the legal foundation for the pilot is existing examination authority, not a new disclosure regime. Whether that assurance is sufficient to resolve the trade secret concern will depend on the specific language in the final adopted version of the tool.
The Four-Exhibit Framework as Permanent Infrastructure
One of the less-discussed implications of the mid-pilot data is that the four-exhibit structure appears to be working well enough that regulators are treating it as permanent examination infrastructure rather than a temporary pilot artifact. The sequential design, where Exhibit A screens and Exhibits B through D provide escalating depth, mirrors the structure of existing financial examination workpapers. State examiners trained on the tool during the pilot are already incorporating its terminology and categories into their standard market conduct processes.
Exhibit A: AI inventory. The opening exhibit asks companies to catalog every AI system in production by type, use case, consumer impact level, and financial materiality. For carriers that have built model inventories in compliance with the AM Best survey-identified governance standards, Exhibit A is a formatting exercise. For carriers that have not, building the inventory from scratch is a substantial project. The AM Best survey found that 68% of insurers rely on third-party AI vendors but only 18% track vendor model risk systematically. Those vendors' models need to appear in the Exhibit A count, even when the carrier has limited visibility into the model's internal workings.
Exhibit B: governance risk assessment. Exhibit B assesses the insurer's AI governance program: roles and responsibilities, board-level oversight, vendor management, monitoring procedures, and integration with Enterprise Risk Management and ORSA processes. Participating states can choose between a narrative format and a checklist format. Early pilot experience suggests the checklist format produces more consistent and comparable responses, and several states have already defaulted to it. The Committee of Annuity Insurers' earlier request to eliminate the narrative option appears likely to be adopted in the September revision.
Exhibit C: high-risk model details. As discussed above, this exhibit is the compliance pressure point. The September revision will need to address the high-risk classification boundary, the trade secret concern, and the practical question of how much documentation regulators can realistically review within a standard examination cycle.
Exhibit D: data inputs. The fourth exhibit traces the upstream data feeding AI systems: external versus internal sources, third-party data licensing arrangements, training data composition, and the lineage of inputs into models that influence consumer outcomes. Exhibit D is the exhibit most likely to be invoked in response to a consumer complaint, because it maps the causal chain from data source to decision output. Market conduct examiners investigating a specific adverse action can use Exhibit D to understand whether the underlying data inputs were appropriate, current, and free from prohibited discrimination.
What Celent, Capgemini, and Carrier Surveys Tell Us About the Denominator
The pilot data is valuable precisely because it comes from regulatory submissions, not voluntary surveys. But placing the pilot findings in context requires understanding the broader AI adoption landscape that the pilot is sampling from.
Celent's 2026 global insurer survey found that 48% of insurers now run generative AI in production, a figure that crosses the late-majority adoption threshold. Capgemini's 344-executive survey found that 42% of P&C insurers have never measured AI outcomes, even as they deploy the technology. And the Conning-Datos survey reported that 82% of insurers have adopted AI in some form, but only 7% have reached enterprise scale.
These three data points create the context for the pilot's proportionality challenge. If 82% of insurers use AI but only 7% are at scale, the evaluation tool needs to differentiate between a carrier running two proof-of-concept chatbots and a carrier running 300 production models across every line of business. The proportionality principle is designed to handle this range, but the mid-pilot data suggests the distribution is skewed: carriers selected for the pilot tend to have larger and more mature AI footprints than the industry median, because state regulators prioritized companies with the most significant AI deployments for the first round of evaluations.
The measurement gap Capgemini identified, where 42% of carriers deploying AI have never formally assessed whether it works as intended, is directly relevant to Exhibit C's testing and monitoring requirements. A carrier that cannot demonstrate ongoing model performance monitoring will struggle to complete Exhibit C in a way that satisfies an examiner, regardless of whether the model is performing well in practice.
Timeline to November: Revision, Re-Exposure, and the Adoption Vote
The pilot's operational timeline runs through September 2026, with the Working Group collecting feedback from participating states on a rolling basis and through monthly coordination calls. The sequence from here is as follows.
July 22, 2026: The second actuarial panel meets, focusing on model validation practices and how actuarial standards (ASOP No. 56, ASOP No. 12, and the newly revised ASOP No. 23) interact with the evaluation tool's documentation expectations. This panel is expected to produce practical guidance that could influence the September revision.
September 2026: The pilot concludes. The Working Group will compile findings from all 12 states and begin revising the tool based on examiner feedback, insurer comments, and the actuarial panel recommendations. Key revision targets include the Exhibit C high-risk classification boundary, the checklist-versus-narrative format question for Exhibit B, and the trade secret protections for Exhibit D data input disclosures.
October 2026: The revised tool is expected to be re-exposed for public comment. Based on the NAIC's standard exposure process, the comment period will likely run 30 to 45 days, landing responses in mid-to-late October or early November. Industry groups have already signaled they will submit comments on the revised version, particularly if the high-risk classification methodology and trade secret protections have not been addressed to their satisfaction.
November 2026 Fall National Meeting: The Working Group plans to submit the revised tool for adoption at the Fall meeting. Adoption at this stage means the tool becomes an officially sanctioned examination instrument that any state can deploy under its existing market conduct and financial examination authority. It does not require legislative action at the state level, because the tool operates within the existing examination framework, not as a new regulatory mandate.
From tracking NAIC working group timelines over the past several years, the March-to-November trajectory is aggressive but consistent with how the Working Group operated on the Model Bulletin, which moved from concept to adoption in roughly 14 months in 2022 and 2023. The difference is that the Model Bulletin was principles-based and did not require carriers to produce specific documentation in a specific format. The Evaluation Tool is procedural and will generate compliance requirements at the operational level. That distinction is why the industry pushback has been more pointed than it was during the bulletin process.
What Happens After Adoption
If the tool is adopted at the Fall National Meeting, individual states will decide whether and when to deploy it. The 12 pilot states will have a head start: their examiners are already trained, their domestic insurers have already been through at least one cycle of the exhibits, and the procedural infrastructure is in place. Non-pilot states will need to train examination staff, which the NAIC is preparing for through examiner education modules that are being developed in parallel with the pilot.
The question of whether the evaluation tool leads to a model law remains formally unanswered. The Working Group's position is that pilot feedback will inform that decision. The 33 comment letters filed in response to the May 2025 RFI on a potential AI model law revealed a deep split between insurers favoring principles-based guidance and consumer groups pushing for prescriptive statutory requirements. The fault lines mapped in those comment letters have not shifted, but the existence of a functioning evaluation tool changes the calculus: if the tool works as an examination instrument without legislative backing, the case for a model law becomes harder to make. If the tool's limitations become apparent during the pilot, particularly if carriers in non-participating states refuse to cooperate without statutory compulsion, the case for a model law strengthens.
NCOIL's parallel Model Act focusing narrowly on the "qualified human professional" requirement for claims denials continues to advance on its own track. Several states may adopt the NCOIL model independently of the NAIC's work, creating overlapping requirements for multi-state carriers. The regulatory pushback on fully automated claims decisions has intensified as the pilot confirms that claims AI operates with less human oversight than any other application category.
Why This Matters for Actuarial Teams
The mid-pilot findings have direct implications for several actuarial work streams, whether or not a carrier is in one of the 12 participating states.
Model inventory as a baseline requirement. Exhibit A establishes that every insurer should be able to produce a current, categorized inventory of AI models in production. For actuaries responsible for pricing or reserving models that incorporate AI components, this means those models need to be identifiable, documented, and classifiable by risk tier. The work of maintaining that inventory overlaps with ASOP No. 56's documentation requirements, but the evaluation tool expects a more structured and centralized format than many companies currently maintain.
High-risk classification touches pricing and reserving. Any pricing model that uses ML-generated risk features is a candidate for high-risk classification under the evaluation tool. The same applies to reserving models that use AI for claims severity prediction or IBNR estimation. Actuaries should be involved in the high-risk classification exercise, because the boundary between "high" and "medium" risk for a pricing model is an actuarial judgment call, not purely a compliance decision.
Vendor AI requires the same documentation. The AM Best finding that 68% of insurers rely on third-party AI vendors but only 18% track vendor risk means that most carriers will need to expand their vendor documentation to satisfy Exhibit B and Exhibit D. For actuaries using vendor-supplied scoring models, telematics data, or aerial imagery outputs, the implication is that vendor contracts may need audit rights, model documentation provisions, and change notification clauses that are not currently standard. The NIST agent standards framework provides a parallel set of governance expectations that vendor procurement teams should be mapping against the NAIC's exhibit requirements.
Bias testing documentation becomes examinable. The evaluation tool's Exhibit C asks specifically about bias testing. The Market Conduct Modernization Working Group formed at Spring 2026 is developing examination methodologies that translate the evaluation tool's bias testing questions into actionable examiner procedures. For actuaries, this means that the bias testing documented in rate filings needs to be consistent with the bias testing reported under Exhibit C, because a regulator could reasonably compare the two and ask about discrepancies.
The November timeline creates a preparation window. Carriers that are not in the pilot have roughly five months before the tool is potentially adopted as a standard examination instrument. That window is sufficient to build or refine a model inventory, establish a high-risk classification methodology, review vendor contracts for documentation gaps, and align bias testing procedures across rate filing and governance reporting. Waiting for adoption to begin this work compresses the preparation into whatever timeline the domestic regulator sets for its first deployment of the tool.
Sources
- NAIC Big Data and Artificial Intelligence (H) Working Group
- NAIC Insurance Topics: Artificial Intelligence
- NAIC AI Systems Evaluation Tool Draft 4.0
- NAIC AI Systems Evaluation Tool Pilot Project Summary
- Fenwick: NAIC Expands AI Systems Evaluation Tool Pilot Program to 12 States
- Monitaur: NAIC AI Systems Evaluation Tool Pilot, A Guide for Insurers
- InsuranceNewsNet: NAIC's 2026 AI Evaluation Pilot Moves Ahead as Industry Balks
- Repairer Driven News: Regulators Examine AI Behind Claims Payouts
- NAIC Request for Information: AI Model Law (May 2025)
- NAIC Model Bulletin: Use of Artificial Intelligence Systems by Insurers (December 2023)
- American Academy of Actuaries: Comment Letter on AI Model Law RFI
- Celent: 2026 Global Insurer GenAI Production Survey
- Capgemini: World Property and Casualty Insurance Report 2025
- Swept AI: NAIC 12-State Pilot Analysis
- Foley & Lardner via Mondaq: What To Do If You Receive a NAIC AI Systems Evaluation Tool Pilot Request
Further Reading
- NAIC AI Evaluation Pilot Launches Amid Industry Pushback: the original pilot structure, the six-association joint letter, and the full four-exhibit breakdown.
- NAIC Targets AI in Claims Handling at Spring 2026: how the Working Group flagged claims for additional scrutiny, the 88% auto insurer AI adoption rate, and state claims AI laws.
- State AI Law Patchwork Forces Carriers Into Four Compliance Regimes: the evaluation pilot as a de facto fourth compliance layer alongside Connecticut SB 5, Colorado SB 26-189, and Texas TRAIGA.
- NAIC Four-Tier AI Risk Taxonomy Redefines Insurer Compliance: the proposed risk classification framework that will anchor the Exhibit C high-risk boundary.
- Market Conduct Modernization Working Group Targets Exam Frameworks: the new (D) Committee body translating pilot findings into structural exam methodology.
- 68% of Insurers Outsource AI, Only 18% Track Vendor Risk: the AM Best vendor governance gap that Exhibits B and D are designed to surface.
- NAIC Weighs Jump From AI Bulletin to Enforceable Model Law: the 33 RFI comment letters and the fault lines shaping whether the evaluation tool leads to legislation.
- Automated Claims Decisions Face Regulatory Pushback: how the pilot intersects with state unfair claims settlement acts and human-in-the-loop requirements.
- NIST AI Agent Standards Set the Compliance Baseline: the parallel federal governance framework relevant to vendor procurement and agent-based AI architectures.