State rate filings work as point-in-time snapshots: the regulator reviews the model as submitted, and that specification is what approval covers. For P&C carriers that periodically retrain ML pricing models on new data, the deployed version can drift materially from that approved snapshot, pricing new business at rates no state regulator has reviewed. The NAIC’s AI Systems Evaluation Tool pilot, which launched across 12 states on March 2, 2026, and runs through September 2026, is the first systematic examination of whether carriers can actually document that gap.

The Version Lock and How It Breaks

When a P&C carrier files a rate change using a machine learning pricing model, the submission captures a specific model version: its architecture, training dataset, feature set, and the parameter values that produce risk differentiation at the time of submission. State regulators review that artifact. What they approve is not a class of algorithms but that particular model instantiation. The filing is a contract over the model’s behavior at a moment in time.

That contract works cleanly for GLMs and traditional actuarial schedules, which change only when the carrier deliberately amends them and files a revision. ML models trained on rolling data windows do not behave the same way. The model approved in January produces different relativities in July when the training set shifts by six months. If the carrier treats the retrain as a routine maintenance event rather than a material change, no amended filing follows the model update. The carrier is now writing new business under rates derived from a model the regulator has never seen.

Consider the mechanics. A personal auto carrier files a gradient-boosted pricing model trained on 36 months of loss data ending December 2024. The regulator approves the filing in April 2025. The carrier retrains the model in July 2025 on data through June 2025, a period that includes a sharp inflation spike in replacement parts and a post-pandemic shift in driving patterns. The retrained model assigns materially different relativities to mileage bands and vehicle age segments. No filing amendment follows because the carrier’s internal classification of the retrain is "scheduled maintenance, within operating parameters." The price any given driver pays after July reflects that second model, not the one the regulator approved.

This is the rate filing gap. It is distinct from model underperformance, where the model degrades but remains at its approved version, and from intentional model change, where the carrier files an amendment. It is a version-control problem embedded in the rate-setting workflow, one that has become systematically more common as carriers migrated from annual model development cycles to continuous or quarterly retrain pipelines between 2021 and 2025.

Reviewing state rate filing submissions and associated model documentation across multiple personal lines carriers since 2023, the version-control gap between approved model specifications and deployed model versions is a consistent finding, particularly at carriers that moved to continuous retrain pipelines without corresponding updates to their regulatory change management processes.

Anticipated Drift and Unanticipated Drift: A Compliance Distinction

Model drift in a pricing context takes two forms, and the regulatory exposure differs between them in ways that matter for the actuarial sign-off chain.

Anticipated drift, sometimes called concept drift, is what happens when a model is retrained exactly as designed, on schedule, with no change to architecture, training procedure, or feature set. The carrier planned to retrain quarterly; it retrained quarterly; the relativities shifted modestly because the underlying loss experience shifted. Regulators in most prior-approval states have not historically treated a scheduled retrain as a material change requiring a new filing, provided the carrier documented that the change fell within the operating parameters of the approved model.

Unanticipated drift is the real compliance problem. This is covariate shift: the statistical distribution of the input variables changes in a way that alters model behavior without any retraining trigger. A carrier pricing homeowners using aerial imagery features may find that post-wildfire reconstruction patterns change the feature distribution in Western states, pushing predicted loss costs in ways the approved model’s developers did not anticipate. A personal auto model using telematics features may drift when driving behavior changes materially following an economic shock. The model’s outputs change, the carrier may not detect the shift until renewal season, and no regulatory notification was triggered because no retrain occurred.

The EIOPA AI Governance Opinion (EIOPA-BoS-25-360), published August 6, 2025, made the monitoring obligation explicit for European carriers: performance metrics must be used "to detect issues like model drift or data degradation," and the opinion endorses SHAP and LIME as tools for identifying which features are driving prediction shifts (EIOPA, August 2025). While EIOPA’s remit does not reach U.S. carriers, the opinion reflects the regulatory consensus now informing NAIC working group discussions. The U.S. pilot is asking the same underlying question through a different instrument.

What the NAIC Evaluation Tool Is Checking

The AI Systems Evaluation Tool pilot launched across 12 states on March 2, 2026: California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin (NAIC pilot materials, March 2026). The tool will be revised based on pilot feedback in September and October 2026, with formal adoption expected at the NAIC fall meeting in November 2026 (Fenwick, March 2026).

The tool is organized into four exhibits. Exhibit A asks carriers to inventory and quantify their AI use across business functions. Exhibit B assesses the governance framework: board accountability, named roles, written policies, and program scope. Exhibit C is where model-specific documentation lives. For high-risk AI systems, which explicitly include pricing, underwriting, claims decisions, and fraud detection, Exhibit C requires model design documentation, training data description, validation procedures, performance metrics, and bias testing results (NAIC AI Systems Evaluation Tool, Draft 4.0). Exhibit D examines data quality controls, representativeness, and proxy discrimination screening.

The model change management question sits primarily in Exhibits B and C. Regulators examining a carrier’s pricing AI under Exhibit C want to see not just the model as currently deployed but a version history demonstrating that the carrier tracked model changes, evaluated their materiality, and made documented decisions about regulatory notification. Foley & Lardner’s guidance to carriers receiving pilot requests put it directly: "version history matters. Regulators will want to see that governance evolved as your AI use did" (Foley & Lardner, March 2026).

That observation carries a forward-looking implication. A carrier that submits a pilot response in 2026 with thin version-control documentation creates a baseline on file with a state regulator. The next market conduct examination, potentially in 2027 or 2028, will start from that baseline. The weak 2026 response is not forgotten; it is filed. Swept AI’s analysis of the pilot similarly noted: "A carrier that produces a weak Exhibit B governance narrative in the pilot now has that narrative on file with a state regulator. That document does not become irrelevant when the pilot ends. It becomes a baseline" (Swept AI, March 2026).

At least 24 states plus the District of Columbia had adopted the NAIC Model Bulletin on AI governance as of early 2026 (Quarles, March 2026). The bulletin requires carriers to maintain written AI governance programs, and those programs inherently must address model version control and change management for pricing systems. The evaluation tool is now the instrument regulators will use to test whether those programs exist in substance, not just on paper.

Monitoring in Practice: Metrics, Thresholds, and the Version Log

Carriers that want to demonstrate meaningful version control need three things: a defined set of performance metrics tracked over time, explicit thresholds that trigger action, and a version log that ties model behavior to specific training events.

The standard metric toolkit for ML pricing model monitoring centers on the Gini coefficient, which measures the model’s discrimination power across the predicted risk spectrum, and lift curves by decile. These track the model’s ability to rank risks correctly. The Kolmogorov-Smirnov statistic on predicted probability distributions provides a complementary read on overall distribution shape. Alongside these discriminatory power metrics, the Population Stability Index measures whether the input feature distribution has shifted from the training period. A PSI above 0.20 on a key rating variable generally warrants investigation; values above 0.25 are a conventional threshold for mandatory revalidation, though the more important discipline is to establish written thresholds in advance rather than rely on informal convention.

The table below maps common monitoring metrics to the type of drift they detect and the typical regulatory action threshold:

Metric Drift Type Detected Conventional Alert Threshold Typical Action
Gini coefficient change Discrimination power loss >3 percentage points vs. validation period Revalidation; materiality review
Population Stability Index Covariate / input distribution shift PSI > 0.20 on any key rating variable Investigate; PSI > 0.25 triggers revalidation
Prediction error by segment Systematic over/underpricing in risk classes Segment loss ratio deviation > 10 points Segment-level materiality analysis; potential amended filing
Lift curve degradation Rank-order performance Top-decile lift drops > 15% vs. development set Retraining trigger; version log entry
K-S statistic Predicted probability distribution shape K-S drop > 0.05 vs. baseline Supplemental validation; escalation review

The version log is the compliance artifact the monitoring program produces. Each retrain event, whether scheduled or performance-triggered, should be logged with: the date, the trigger type, the training data period covered, performance metrics before and after the retrain, the relative impact across key rating variables, and a documented regulatory materiality determination. That determination should be made and signed by a named pricing actuary at the time of the retrain, not reconstructed from memory when an examiner requests it.

The Actuarial Sign-Off Chain

P&C pricing actuaries sit at the intersection of model governance and regulatory compliance. When a carrier deploys an ML pricing model subject to retrain, the actuary’s sign-off responsibility does not end at the original filing. The question of when a retrain constitutes a material model change requiring regulatory action is a judgment that requires actuarial input, and the failure to make that judgment explicitly is itself a governance gap.

The regulatory standard for materiality varies by state and, in most cases, has not been updated to specifically address ML model retraining. The default framework in most prior-approval states defines materiality in terms of aggregate rate level impact: a change that moves the aggregate rate level beyond a specified threshold, often 1% to 3%, requires a new filing. Applied to ML pricing models, this framework is inadequate. A retrain that leaves aggregate rates flat can still produce significant shifts in the relativities charged to identifiable risk segments. Those segment-level shifts may constitute material changes to the rate classification system even if the aggregate premium impact is zero.

The actuarial judgment required is: did this retrain change the model’s effective risk classification in a way that a reasonable regulator would expect to review? That judgment requires comparing pre- and post-retrain model outputs across the full rating universe, not just the aggregate premium impact. Carriers whose pricing teams make that comparison systematically, document the conclusion, and retain the analysis as a compliance artifact are in a defensible position. Carriers that default to "it was the scheduled retrain, so no filing required" without the supporting analysis are not.

The sign-off chain should be explicit. Someone with actuarial responsibility must review the materiality analysis, sign the version log entry, and either approve continued production use of the retrained model or escalate to regulatory counsel for a filing determination. At carriers where the retrain is automated, the materiality review should be a gated step before the retrained model is promoted to production. Automation of the retrain does not automate away the actuarial judgment about its regulatory consequences.

Building the Compliance Artifact: A Governance Workflow

The governance workflow that closes the rate filing gap has four components, each of which should be documented in the carrier’s written AI governance program.

Monitoring cadence tied to retrain frequency. Establish the metric suite and monitoring schedule as part of the model’s production deployment documentation, not as an afterthought. For models on a quarterly retrain schedule, run the full metric set immediately before and after each retrain. For models with rolling continuous updates, sample metrics monthly. The monitoring obligation should be documented in the model’s governance plan so that it follows the model across team changes.

Written materiality thresholds. Define in advance, at the model level, what constitutes a material change: the aggregate rate level threshold, the segment-level relativity deviation threshold, and the feature distribution threshold. These should be specific enough that a pricing actuary can apply them without judgment discretion in ambiguous cases. Vague thresholds ("significant change") are defensible to no one. Specific thresholds, even if imperfect, demonstrate that the carrier thought about the problem before an examiner asked.

The version log as a compliance record. The version log should contain every retrain event with date, trigger, training data period, before-and-after performance metrics, segment-level relativity impact summary, materiality determination, the actuary’s name and date of sign-off, and the regulatory action taken or explicitly not taken. This is the document that Exhibit C of the NAIC evaluation tool is designed to surface. Carriers that can produce a clean, timestamped version log for each pricing model are demonstrating exactly the governance maturity the pilot is designed to measure.

Escalation path with a defined timeline. Define what happens when drift exceeds tolerance: who reviews the expanded analysis, who makes the filing determination, and how quickly that determination is made relative to the model’s production use. A model that has been flagged for material drift should not continue pricing new business indefinitely while the materiality determination is pending, unless the carrier can demonstrate that the aggregate rate level impact remains within approved parameters. The escalation path should name roles, not individuals, so it survives personnel changes.

Allstate’s June 2026 machine learning monitoring patent formalizes a comparable workflow on the claims side: the patent covers AI drift detection, alerting, and retraining controls that operate as a governance layer over production ML systems (Allstate, June 2026). The claims side architecture is directly analogous to what the pricing side requires. Carriers that have built drift governance for claims AI already have the operational template; extending it to pricing models is an implementation step, not a design problem.

Why This Matters Beyond the Pilot

The NAIC evaluation tool pilot runs through September 2026 and targets formal adoption in November. By 2027, the tool is likely to become a standard component of market conduct examinations in states that have adopted the Model Bulletin. The 12-state pilot group includes California, Florida, Pennsylvania, and Virginia, collectively among the largest insurance markets in the country. When those states begin using the tool in routine market conduct exams, the carrier that cannot produce a version log for its pricing AI will face a structural compliance gap in its most consequential markets.

The industry’s AI concentration makes the stakes higher. State Farm, USAA, and Allstate together account for approximately 77% of AI patents filed by P&C insurers (Evident Insurance AI Patent Tracker, 2025). As the largest carriers industrialize ML pricing at scale, the version-control problem compounds: more models in production, more retrain cycles, more potential divergence between approved and deployed specifications. The governance infrastructure that large carriers build in 2026 will become the de facto standard against which regulators benchmark the industry.

For P&C pricing actuaries, the practical imperative is clear. The rate filing gap is not a theoretical compliance edge case. It is a structural feature of how ML pricing deployment currently operates at most carriers, and the NAIC’s 12-state pilot is the first systematic instrument designed to measure how well the industry has closed it. Carriers that can produce a clean version log, a documented materiality process, and a defined escalation path in 2026 will be demonstrating exactly the governance maturity regulators are looking for. Carriers that cannot will have that gap on file with a state regulator, as a baseline for the examination cycle that follows.

Further Reading

Sources

  1. NAIC, AI Systems Evaluation Tool Pilot: Pilot Project Summary (March 2026) — Pilot launch date, participating states, timeline, and evaluation dimensions.
  2. Fenwick, “NAIC Expands AI Systems Evaluation Tool Pilot Program to 12 States” (March 2026) — Pilot scope, carrier obligations, and adoption timeline for the November 2026 NAIC fall meeting.
  3. Foley & Lardner, “What To Do If You Receive an NAIC AI Systems Evaluation Tool Pilot Request” (2026) — Carrier response strategy and the version-history documentation baseline implication.
  4. Quarles, “Nearly Half of States Have Now Adopted NAIC Model Bulletin on Insurers’ Use of AI” (March 2026) — State adoption count and the written AI program requirement for model change management.
  5. EIOPA, Opinion on AI Governance and Risk Management (EIOPA-BoS-25-360, August 6, 2025) — Performance metrics for drift detection, SHAP/LIME endorsement, and actuarial function accountability for AI controls.
  6. Swept AI, “NAIC AI Systems Evaluation Tool: 12-State Pilot Is Live” (March 2026) — Pilot structure, exhibit breakdown, and the governance narrative baseline risk.
  7. Monitaur, “NAIC AI Systems Evaluation Tool Pilot: A Guide for Insurers” (2026) — Exhibit C documentation requirements for high-risk AI systems, including pricing models.
  8. InsNerds / Insurance Business, “AI Patent Trends by State Farm, USAA, and Allstate Signal Strategic Innovation for P/C Insurers” (2025) — The 77% patent concentration among three carriers and its implications for industry-wide AI governance standards.
  9. Allstate, Machine Learning Monitoring Patent (June 2026), as analyzed on actuary.info — Drift-detection, alerting, and retraining control architecture on the claims side as an operational template for pricing governance.