How Actuaries Validate AI Models for State Rate Filings

From reviewing the compliance reporting templates emerging from the 12-state evaluation tool pilot, the validation gap between traditional GLM documentation and what regulators now expect for AI models is wider than most carriers realize. A generalized linear model rate filing historically required coefficient tables, relativity charts, and a narrative connecting rating variables to loss experience. A gradient-boosted decision tree or neural network used for territorial risk segmentation, claims triage scoring, or multivariate rate optimization requires an entirely different documentation stack: data lineage tracing, feature-to-factor mapping, bias testing evidence, model drift monitoring protocols, and regulatory-grade explanations that translate SHAP values into language examiners can act on.

This article provides a step-by-step validation workflow for actuaries preparing ML-augmented rate filings. It bridges the gap between ASOP No. 56 (Modeling) documentation requirements and the proposed NAIC compliance reporting structure, drawing on the four-tier risk taxonomy presented at the Spring 2026 National Meeting, state-level precedents from Colorado and New York, and patterns we have seen in deficiency notices across recent predictive model filings.

The 24-State Adoption Landscape: Who Enforces What

The NAIC Model Bulletin on the Use of Artificial Intelligence Systems by Insurers, adopted in December 2023, has been embraced by 24 states and the District of Columbia as of March 2025. Four additional states have layered insurance-specific AI regulations on top. The bulletin itself is principles-based guidance, not binding regulation, but the adoption pattern matters for rate filing actuaries because the documentation expectations it creates are now baked into market conduct examination protocols across nearly half the country.

Not all adopting states treat the bulletin with equal weight. From tracking adoption dates and enforcement patterns, three tiers of regulatory intensity have emerged:

Enforcement Tier	States	Practical Impact on Rate Filings
Active enforcement with supplemental requirements	Colorado (SB 21-169 + CAIA), New York (Circular Letter No. 7), Connecticut, Illinois	Separate predictive model supplemental filings required; bias testing documentation mandatory; compliance reports due on specific deadlines (Colorado: July 1, 2026 for auto and health); filings returned with deficiency notices when AI documentation is inadequate
Bulletin adopted, examination integration underway	California, Maryland, Pennsylvania, Virginia, Wisconsin, Iowa, Florida, Rhode Island, Vermont (evaluation tool pilot states)	AI governance questions incorporated into market conduct exams; evaluation tool pilot running March through September 2026; regulators building competency through weekly coordination calls; documentation requests flowing to domestic insurers during exams
Bulletin adopted, enforcement deferred	Alaska, Arkansas, Delaware, Hawaii, Kentucky, Massachusetts, Michigan, Nebraska, Nevada, New Hampshire, New Jersey, North Carolina, Oklahoma, Washington, West Virginia	Bulletin language referenced in regulatory correspondence; no systematic AI-specific examination protocols yet; compliance preparedness varies by carrier size and sophistication

For multi-state carriers, the practical question is whether to build separate validation documentation for each jurisdiction or construct a single framework that satisfies the strictest adopters. The cost-benefit math tilts decisively toward the latter. Colorado and New York together account for enough premium volume that any carrier writing personal lines nationally needs to meet their standards. Carriers building documentation to Colorado’s SB 21-169 requirements, which mandate bias testing across nine protected classes with four-fifths rule thresholds and annual compliance reports, will satisfy the documentation expectations of every other adopting state.

ASOP No. 56 vs. the NAIC Compliance Report: Mapping the Gap

ASOP No. 56 (Modeling), effective since October 1, 2020, provides the actuarial profession’s documentation framework for any model, including AI and machine learning systems. It requires actuaries to understand the model, document data sources, note peer review procedures, and disclose limitations. These requirements are sound but intentionally principles-based, written before the current generation of ML pricing models reached production deployment at scale.

The NAIC’s proposed compliance report structure, presented alongside the four-tier risk taxonomy at the Spring 2026 National Meeting, goes substantially further. According to the Mayer Brown summary of the meeting, the compliance report template covers seven distinct areas. Here is how each maps against existing ASOP No. 56 requirements:

NAIC Compliance Report Component	ASOP No. 56 Coverage	Gap for AI Rate Filings
Executive summary and reporting authors’ credentials	Partially covered: ASOP 56 requires identifying the actuary and scope of work	NAIC template requires board oversight documentation and organizational AI governance chart; ASOP 56 does not address governance committee structure
Internal and external data source assessment	Covered: Section 3.2 requires documenting data and their sources	NAIC template requires provenance tracking, quality controls, representativeness assessments, and proxy variable identification for each data source; ASOP 56 does not specify this granularity
Risk assessment framework	Not covered directly	NAIC four-tier risk taxonomy classification (unacceptable, high, medium, low) required for each AI system; no ASOP analog exists for risk-tiered model governance
Model inventory with standardized model cards	Partially covered: ASOP 56 requires model documentation but no standard format	NAIC template specifies model card fields: system ID, intended use, training data summary, performance metrics, bias testing results, monitoring cadence, third-party components, and known failure modes
Governance structure and oversight	Partially covered: peer review discussed but organizational governance not specified	NAIC template requires cross-functional governance committee documentation, vendor risk management programs, and human-in-the-loop decision architecture
Drift testing and validation methods	Partially covered: ASOP 56 requires ongoing validation but no monitoring framework	NAIC template requires specific drift detection mechanisms, revalidation trigger criteria, and continuous monitoring cadence for production ML systems
Protected class inference and bias testing protocols	Not covered: ASOP 56 does not address fairness testing	NAIC template requires methodology, frequency, statistical thresholds, and remediation documentation for disparate impact testing across protected classes

The gap is widest in two areas: bias testing and model drift monitoring. ASOP No. 56 was designed for a world where actuaries built and understood their own models. It assumes the actuary can examine coefficient values, test assumptions against experience, and articulate why each rating variable is actuarially justified. Machine learning models challenge each of those assumptions. A gradient-boosted model with 200 features does not produce coefficients in the traditional sense. Its variable interactions are learned from data, not specified by the modeler. The documentation workflow required to demonstrate that such a model is actuarially sound, free from unfair discrimination, and stable over time is fundamentally different from what ASOP 56 contemplated.

Documenting Actuarial Soundness for ML Pricing Models

The central challenge for actuaries filing AI-augmented rate indications is answering a deceptively simple regulatory question: how do you prove that a black-box model produces actuarially justified rates?

For a GLM, the answer is embedded in the model structure. Each coefficient corresponds to a rating variable with a clear directional relationship to loss experience. A ZIP code relativity of 1.35 means that territory produces 35% more losses than the base level, and the actuary can verify that relationship against actual loss ratios. For a gradient-boosted decision tree or a neural network, no such direct mapping exists. The model learns complex, non-linear interactions among hundreds of features, and the relationship between any single input and the output depends on the values of all other inputs.

Carriers navigating this challenge are converging on a four-component validation approach:

1. Champion/Challenger Testing Against the Filed GLM

The most effective validation strategy we have seen treats the existing filed GLM as the champion model and the ML system as the challenger. This does not mean the ML model must replicate GLM results. It means the documentation must show, quantitatively, where and why the ML model diverges from the filed rate structure and demonstrate that those divergences are actuarially justified.

In practice, this means running both models against the same holdout dataset and producing side-by-side comparisons of predicted versus actual loss ratios across all rating dimensions. Where the ML model outperforms the GLM (better separation of good and bad risks, more accurate territorial pricing, improved frequency-severity decomposition), the filing narrative should quantify the improvement. Where the models diverge without a clear actuarial explanation, the actuary needs to investigate whether the ML model is picking up signal from variables that correlate with protected classes.

2. Feature-to-Factor Mapping

This is where most rate filings using ML models encounter deficiency notices. Every feature in the model must be explicitly mapped to a filed rating factor. Regulators require documentation showing how model features "roll up into, support, or modify" filed factors. Where the mapping is not one-to-one, the filing must provide actuarial justification and confirm compliance with state rating laws.

For a gradient-boosted model using 47 features to produce territorial risk scores, that mapping exercise is substantial. Some features may correspond directly to filed factors (vehicle age, driver age, coverage limits). Others may be engineered features (interaction terms, rolling averages, spatial features derived from geocoded addresses) that have no direct analog in the filed rate manual. Each of those engineered features needs documentation explaining its construction, its relationship to loss experience, and why it does not serve as a proxy for a protected characteristic.

3. Explainability Artifacts: SHAP, PDP, and the Regulatory Translation Problem

The insurance industry has broadly adopted SHAP (SHapley Additive exPlanations) values and partial dependence plots (PDPs) as the standard explainability toolkit for ML pricing models. SHAP values decompose each individual prediction into additive contributions from each feature, showing how much each variable pushed a specific prediction above or below the model’s average. PDPs show how the model’s predictions change as a single feature varies while all others are held constant.

These are technically sound validation tools. They are not, by themselves, regulatory explanations.

This distinction is the source of more deficiency notices in predictive model rate filings than any other single artifact. A SHAP plot is a graph. An examiner needs a narrative artifact connecting the technical output to regulatory language: filed rating factors, actuarial justification under state rating laws, protected class analysis, and disparate impact assessment. Carriers that submit SHAP summary plots without the accompanying regulatory narrative find their filings stalled in deficiency notice cycles.

The validation workflow must produce two separate outputs from the same underlying analysis:

Technical validation artifacts. SHAP value distributions (global and local), partial dependence and individual conditional expectation (ICE) plots for each feature, Friedman-Popescu H-statistics quantifying interaction strength between features, and permutation importance rankings. These serve the internal model governance function and satisfy ASOP No. 56’s requirement that the actuary understand the model.
Regulatory explanation documents. Written narratives in examiner language that open with a plain-language description of the model’s scope, document data lineage, map features to filed rating factors with actuarial justification for each, present fairness testing results with specific proxy variables evaluated, and describe the conditions under which human review overrides model output. These are the actual filing artifacts.

4. Out-of-Time Validation and Stability Testing

Regulators increasingly expect rate filings to demonstrate that the ML model performs consistently across different time periods, not just on the validation dataset used during development. Out-of-time validation trains the model on data from one period and tests it on data from a subsequent period, simulating how the model will perform on future business.

For ML pricing models, this testing goes beyond traditional validation. Gradient-boosted models and neural networks can overfit to temporal patterns in training data (loss trends driven by a specific catastrophe season, pandemic-era driving behavior changes, transient economic conditions) in ways that GLMs, with their simpler structure, generally do not. The filing should include out-of-time performance metrics across at least three non-overlapping holdout periods, demonstrating stable lift and discrimination across different market conditions.

Carriers with models that retrain on rolling data windows face an additional question: does the regulatory approval attach to the model architecture, the specific trained instance, or an output tolerance band? No state has issued definitive guidance on this point. The safest approach, based on patterns we have seen in filed deficiency notices, is to define materiality thresholds for model output changes (for example, a 2% shift in statewide average premium or a 5% change in any individual territory relativity) and file a new rate indication when those thresholds are breached.

The Four-Tier Risk Taxonomy and Validation Rigor

The NAIC four-tier risk taxonomy introduced at the Spring 2026 National Meeting directly affects how much validation documentation a rate filing needs. The taxonomy classifies AI systems into four levels (unacceptable, high, medium, low), and the compliance requirements scale accordingly.

Most AI systems used in rate filings fall into the high-risk tier. The NAIC staff presentation defined high-risk systems as those with "potential for significant harm if failure or misuse occurs." A pricing model that sets final rates, a territorial risk segmentation algorithm, or an underwriting triage system with binding authority all qualify. Under the proposed framework, high-risk systems require the full compliance report, a standardized model card, bias testing evidence, continuous drift monitoring, and human-in-the-loop protocols.

The practical implication for rate filing actuaries: every ML model that materially influences filed rates needs the complete validation stack. There is no abbreviated pathway for AI-augmented pricing.

Low-risk systems (internal analytics dashboards, workflow automation, document extraction tools) require only inventory tracking in Exhibit A of the evaluation tool. Medium-risk systems (customer-facing chatbots, AI-assisted but not AI-decided claims triage) require transparency disclosure and periodic review. The resource allocation question for carriers is straightforward: concentrate validation investment on the high-risk pricing and underwriting models, and apply proportionate lighter-touch documentation to everything else.

Bias Testing: From Colorado Precedent to National Standard

Bias testing for AI rate filings has evolved from a single-state requirement to an expected national standard. Colorado’s SB 21-169 was the first state law to require insurers to test AI systems for discriminatory outcomes across protected classes. Its implementing regulation (Colorado Insurance Regulation 10-1-1) specifies testing methodologies, including disparate impact analysis using the four-fifths rule as a baseline: if the selection rate for any protected class falls below 80% of the rate for the most favored class, the model produces disparate impact.

New York’s Circular Letter No. 7, issued in July 2024, extended similar expectations to all insurers authorized in New York. The circular requires analysis of AI systems for unfair and unlawful discrimination, demonstration of actuarial validity, maintenance of a governance framework, and appropriate transparency. While framed as guidance rather than regulation, the NYDFS’s examination authority means non-compliance invites scrutiny.

The NAIC compliance report template formalizes these expectations nationally. The bias testing section requires:

Protected classes tested. At minimum, the classes specified in state unfair discrimination statutes. Colorado’s list of nine classes (race, color, national or ethnic origin, religion, sex, sexual orientation, disability, gender identity, gender expression) is the current high-water mark.
Proxy variable evaluation. Documentation of which model features were evaluated as potential proxies for protected characteristics, the methodology used (correlation analysis, causal inference, Bayesian Improved Surname Geocoding for race/ethnicity estimation), and the results.
Statistical methodology and thresholds. The specific test applied (disparate impact ratio, statistical parity, equalized odds), the threshold used, and the rationale for selecting that threshold.
Remediation documentation. When bias testing identifies disparities, the filing must document what remediation was applied (feature removal, constraint optimization, post-processing adjustments) and provide post-remediation test results showing the disparity was resolved without materially degrading model performance.

The proxy variable question is particularly acute for ML pricing models. ZIP code interacted with vehicle age, credit score, and commute distance may produce discriminatory rate patterns even if no single variable shows a concerning correlation with a protected class in isolation. Testing individual features for proxy effects is necessary but insufficient. The filing must also test the model’s aggregate output, the actual rates produced, against protected class distributions estimated through proxy methodologies such as BISG.

Model Drift Monitoring: Building the Continuous Compliance Pipeline

Traditional GLM rate filings are point-in-time documents. The model is built, validated, filed, and left largely unchanged until the next rate revision. ML models break this paradigm. They can retrain on rolling data windows, and their performance can degrade as the underlying data distribution shifts (new vehicle types, changes in driving patterns, shifts in claims reporting behavior, emerging perils).

The NAIC compliance report template requires carriers to document specific drift detection mechanisms, revalidation trigger criteria, and continuous monitoring cadence. For rate filing actuaries, this means building three monitoring components into the production model pipeline:

Input drift detection. Statistical tests (Kolmogorov-Smirnov, Population Stability Index, chi-squared tests for categorical variables) comparing the distribution of incoming rating data against the training data distribution. When input distributions shift beyond predefined thresholds, the system flags the model for review.
Output drift monitoring. Tracking actual-versus-expected loss ratios at a granular level (by territory, coverage, policy characteristics) against the model’s predictions. Systematic divergence indicates the model’s learned relationships no longer hold. The monitoring cadence should match the rate review cycle; quarterly is typical for personal lines.
Fairness drift tracking. Bias testing is not a one-time exercise. Models approved in one period must continue to meet fairness standards as the book of business and underlying data evolve. The monitoring pipeline should recompute disparate impact metrics on the production portfolio at least annually, or whenever the model retrains, whichever comes first.

For carriers with models that retrain on rolling data windows, the monitoring pipeline also needs to track whether retrained model outputs remain within the tolerance band defined in the original filing. If a retrained model shifts statewide average premium by more than the materiality threshold, a new filing may be required.

Multi-State Compliance: One Framework, Multiple Jurisdictions

Building separate AI validation documentation for each of the 24 adopting states is operationally impractical. The effective strategy, validated by how the largest carriers are approaching the 12-state evaluation tool pilot, is to build a single validation framework calibrated to the strictest requirements and adapt the output format for each jurisdiction.

The framework has four layers:

Core model documentation. Model card, training data summary, feature-to-factor mapping, performance metrics, and out-of-time validation results. This is the foundation layer and satisfies ASOP No. 56 regardless of jurisdiction.
Bias testing and fairness evidence. Disparate impact analysis across protected classes using the broadest classification (Colorado’s nine classes), proxy variable evaluation, and remediation documentation. Building to the Colorado standard ensures coverage in all other states.
Governance and oversight documentation. AI governance committee charter, reporting lines, vendor risk management program, human-in-the-loop architecture, and consumer complaint procedures. This satisfies the NAIC compliance report template and the potential model law requirements under development.
State-specific adaptations. Filing format modifications for SERFF supplemental submissions, jurisdiction-specific protected class definitions, state-specific compliance report deadlines (Colorado’s July 1, 2026 deadline for auto and health), and New York’s third-party vendor audit requirements.

This layered approach means the actuarial team builds the validation artifacts once and generates jurisdiction-specific filing packages from the same underlying documentation. The incremental cost of adding a new state filing drops from a full re-validation exercise to a formatting and adaptation exercise.

The Evaluation Tool Pilot: What Actuaries Should Expect

The AI Systems Evaluation Tool pilot, running from March through September 2026 across 12 states (California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin), uses four exhibits to structure regulatory inquiries:

Exhibit A: Quantifies AI usage across the organization, functioning as the model inventory layer.
Exhibit B: Documents the governance risk assessment framework, including committee structure, oversight protocols, and accountability designations.
Exhibit C: Requests detailed information on high-risk AI systems, including model architecture, training data, validation procedures, performance metrics, and bias testing results.
Exhibit D: Covers AI data details, including a section on reasonable accommodations or policy modifications added in the version 4.0 update.

Participating states are deploying the tool within market conduct examinations, financial examinations, and general regulatory inquiries. The pilot applies a principle of proportionality, prioritizing high-risk AI systems with direct consumer impact. Regulators are holding weekly coordination calls to share insights across states.

For rate filing actuaries at carriers selected for the pilot, the exhibits represent a preview of what will likely become standard examination protocol after the tool is finalized at the NAIC Fall 2026 National Meeting. The documentation you prepare for the evaluation tool and the documentation you prepare for your rate filing should be the same documentation. Building separate artifacts for the examination and the filing doubles the work without improving compliance outcomes.

The Vendor Documentation Problem

Many carriers use third-party vendor models for components of their rate filing: credit-based insurance scores from LexisNexis or TransUnion, telematics scoring models from Cambridge Mobile Telematics or Arity, catastrophe models from Verisk or Moody’s RMS, or territorial risk overlays from various data vendors. When those vendor models incorporate machine learning, the carrier bears regulatory responsibility for validation and documentation that the vendor may not fully support.

The NAIC compliance report template requires carriers to document third-party components in their model cards and describe vendor governance sufficiency. New York’s Circular Letter No. 7 goes further, requiring contractual provisions that give the insurer the right to audit third-party vendors and obligate vendors to cooperate with regulatory inquiries. The proposed third-party vendor registry under development by the Third-Party Data and Models Working Group would add a centralized NAIC database of vendor model registrations, though outstanding questions remain about whether participation will be mandatory or voluntary.

For the rate filing actuary, the vendor documentation problem creates a practical burden: you must document and validate components of your filed model that you did not build and may not fully understand. The contractual and governance groundwork for vendor transparency needs to be in place before the rate filing process begins, not bolted on after a deficiency notice arrives.

A Ten-Point Examiner Checklist

Based on patterns we have tracked across deficiency notices and supplemental data requests in recent predictive model rate filings, ten evaluation criteria consistently surface:

Model identification, including all vendor-supplied components, version numbers, and deployment dates.
Reconstructible data lineage, source systems, extraction dates, sample construction rules, missing value handling, and transformation logic with volume metrics at each preparation stage.
Feature-to-factor mapping with actuarial justification for every feature, including engineered features with no direct filed factor analog.
Documented bias testing methodology, specifying protected classes tested, proxy variables evaluated, statistical methods and thresholds applied, and results.
Post-remediation test results where disparities were identified, demonstrating the disparity was resolved.
Human override processes, describing the conditions under which model output is subject to human review and the criteria for override.
Model inventory entries conforming to the model card format, with version history, training and validation dates, and ownership designation.
Ongoing drift and disparate impact monitoring commitment, with specific metrics, thresholds, cadence, and escalation procedures.
Vendor model documentation sufficiency, demonstrating that third-party model components meet the same validation standards as internally developed models.
Retention periods aligned with state bad-faith statutes and regulatory retention requirements for model artifacts.

Filings that address all ten points proactively move through the review process faster than those that respond to deficiency notices iteratively. A single deficiency notice cycle can add 60 to 90 days to the filing timeline, and ML model filings that fail to anticipate regulatory questions often accumulate multiple rounds.

What This Means for Actuarial Practice

The convergence of the 24-state Model Bulletin adoption, the 12-state evaluation tool pilot, and the potential model law under development creates a clear trajectory: AI model validation for rate filings is shifting from voluntary best practice to regulatory requirement. Actuaries who build their validation workflows now, before the evaluation tool is finalized and adopted at the Fall 2026 National Meeting, will be positioned for compliance. Those who wait will face a documentation retrofit exercise under deadline pressure.

Three near-term actions for rate filing actuaries:

Inventory every AI and ML component in your filed rate models. Include vendor-supplied models, internally developed ML systems, and any automated decision logic that influences filed rates. Classify each by the NAIC four-tier risk taxonomy. This inventory becomes Exhibit A of the evaluation tool and the foundation of your compliance report.
Build the bias testing pipeline now. Use Colorado’s nine protected classes as the testing standard. Implement BISG or equivalent proxy methodology for race and ethnicity estimation. Run disparate impact analysis on your production model’s actual rate output, not just the model’s predicted values. Document the methodology, results, and any remediation applied. This is the component most likely to generate deficiency notices if missing from a rate filing.
Establish drift monitoring for production ML models. Implement input distribution monitoring, output actual-versus-expected tracking, and fairness metric recomputation on a cadence that matches your rate review cycle. Define materiality thresholds for when model output changes require a new filing. Document the monitoring architecture in your model card.

The governance gap between what ASOP No. 56 requires and what regulators now expect for ML models is real. Closing it requires building validation infrastructure that traditional actuarial practice never needed. The carriers that invest in that infrastructure now will find that the same documentation serves rate filings, market conduct examinations, evaluation tool responses, and compliance reports simultaneously. The validation workflow is one workflow. The filing artifacts are multiple outputs from the same foundation.