The NAIC launched its AI Systems Evaluation Tool pilot in March 2026 with 11 participating states: Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin. For pricing actuaries, the pilot's significance extends beyond governance documentation. It signals that regulators are moving from reviewing GLM coefficient tables, where each rating variable has an isolated and interpretable relativity, to interrogating the outputs of gradient-boosted machines and neural networks that resist variable-by-variable decomposition. The NAIC Model Review Manual, adopted by the Casualty Actuarial and Statistical (C) Task Force on November 4, 2025, codifies documentation standards for GLM-based rate filings. Complex models have no comparable regulatory template. This article walks through the methodology shift, the documentation pricing actuaries will need to produce, and the precedents already being set.

The Traditional GLM Filing and Its Regulatory Clarity

For decades, the GLM-based rate filing has been the standard language of regulatory review. A pricing actuary files a multiplicative relativity model where each rating variable produces an isolated coefficient. The regulator opens the filing, reviews the coefficient for driver age, the coefficient for territory, the coefficient for vehicle symbol, and can directly evaluate whether any single factor produces rates that are unfairly discriminatory. The transparency is structural: a change in one variable's coefficient changes the indicated rate by a calculable amount, holding all else constant.

This format gives regulators three things they need. First, interpretability: every rating factor has a human-readable weight. Second, auditability: the actuary can reproduce any policyholder's indicated rate by multiplying the base rate by the applicable relativities. Third, isolation: if a regulator has concerns about a specific variable, such as credit score or zip code, they can examine that coefficient without needing to understand the rest of the model. The NAIC Model Review Manual formalized these expectations into documentation standards that state regulators can apply uniformly in rate filing reviews (NAIC CASTF, November 2025).

Where Machine Learning Breaks the Filing Template

Gradient-boosted trees, random forests, and neural networks produce predictions through non-linear interactions among hundreds or thousands of features. There is no coefficient table. A GBM prediction for a given policyholder depends on the specific combination of feature values traversed through an ensemble of decision trees, and the contribution of any single feature shifts depending on the values of other features. This is not a documentation gap that better writing can fill; it is a structural property of the model class.

Consider a GBM used for homeowners loss cost prediction. The model may learn that roof age interacts with wind zone and distance to coast in ways that a multiplicative GLM cannot capture. That interaction produces more accurate predictions. But a regulator reviewing the filing cannot open a coefficient table and check whether "roof age" produces a 1.15 relativity across all territories, because the model does not work that way. The same feature contributes differently depending on the values of correlated features, and the contribution cannot be decomposed into a static table without losing the model's actual behavior.

This is the core tension the NAIC evaluation pilot is designed to address. Regulators need a way to audit models that outperform GLMs in predictive accuracy but resist the coefficient-level review that has been the foundation of rate filing supervision for decades. The pilot's Exhibit C, which requests detailed information on high-risk model development and testing, is where this tension plays out most directly for pricing actuaries submitting ML-based rate indications.

SHAP, Partial Dependence, and ALE: Replacing the Coefficient Table

The emerging regulatory documentation standard replaces the GLM coefficient table with three complementary explainability tools that pricing actuaries must now build into their model development workflow.

SHAP (SHapley Additive exPlanations) values decompose an individual prediction into the additive contribution of each feature. For a policyholder rated at $1,200 annual premium against a portfolio average of $900, SHAP values attribute the $300 difference across features: $80 from credit score, $120 from territory, $60 from vehicle age, $40 from claims history, and so on. The mathematical foundation is cooperative game theory (Lundberg and Lee, NeurIPS 2017), where each feature's contribution is calculated as its average marginal impact across all possible orderings of features. For regulatory purposes, SHAP values serve the same function as GLM coefficients: they tell the regulator how much each variable contributes to a specific rate indication. Unlike coefficients, SHAP values are observation-specific, which means a pricing actuary must present summary statistics, including mean absolute SHAP by feature and the distribution of SHAP values across the portfolio, to give regulators the portfolio-level view they need.

Partial dependence plots (PDPs) show how the model's average prediction changes as a single feature varies across its range, holding all other features at their observed values. A PDP for driver age might show that predicted loss cost decreases from age 18 to 30, flattens from 30 to 60, and increases again after 65. This gives the regulator the directional information they would get from a GLM coefficient without requiring the model to impose that relationship parametrically. The limitation of PDPs is that they assume feature independence: if driver age correlates with driving experience, the PDP for age will blend the age effect with the experience effect.

Accumulated local effects (ALE) plots address that correlation problem. Instead of averaging the model's output across all observations at a given feature value, ALE plots measure the local effect of small changes in a feature value, then accumulate those effects (Apley and Zhu, JRSS-B 2020). The result is an unbiased estimate of each feature's marginal effect even when features are correlated. For rate filings where territory and demographic proxy variables correlate heavily, ALE plots provide a cleaner view than PDPs and are increasingly what sophisticated state regulators expect to see.

Together, these three tools give regulators a documentation package that is functionally equivalent to a GLM coefficient table in information content, though not in format. The shift requires pricing actuaries to build explainability pipelines into their model development workflow as a core deliverable of the rate indication process, not as an afterthought.

Output-Based Disparate Impact Testing

The second methodological shift in the evaluation pilot is the move from variable-level review to output-based disparate impact testing. In a GLM filing, a regulator concerned about race as a protected class would examine whether any rating variable serves as a proxy for race. With a machine learning model, proxy analysis at the variable level is insufficient because correlations compound through non-linear interactions in ways that no single variable review can detect.

The methodology regulators are now requiring works differently. The pricing actuary runs the model's rate indications across the full portfolio, then overlays geocoded demographic distributions, typically census-tract-level data on race, ethnicity, and income, to test whether the model produces statistically significant rate differentials across protected classes. This is not testing whether any variable is a proxy; it is testing whether the model's combined output, regardless of which variables produce it, results in disparate impact.

The statistical test varies by state. Some regulators are using adverse impact ratios, with the four-fifths rule as a threshold: if the rate for one demographic group exceeds 80% of the rate for the reference group, the model is flagged for further review. Others are using regression-based approaches that control for legitimate actuarial variables and test whether a protected-class variable adds explanatory power to the residuals. Colorado's Division of Insurance has been the most prescriptive, requiring carriers to demonstrate that any ML model producing rate differentials exceeding defined thresholds has been tested using methods the Division has pre-approved.

Colorado's GBM Rejection Sets the Precedent

Colorado has already established a precedent that will shape how pricing actuaries approach ML-based filings nationally. Under the state's Artificial Intelligence Act (SB 21-169), the Division of Insurance rejected a gradient-boosted tree model for homeowners pricing despite the model demonstrating superior predictive performance compared to the carrier's existing GLM. The basis for rejection was not that the model performed poorly or used prohibited variables; it was that the carrier's disparate impact analysis was insufficient to demonstrate the model did not produce unfairly discriminatory outcomes.

The filing included SHAP-based explainability documentation and standard model validation metrics (Gini coefficient, lift charts, hold-out sample performance). What it lacked was a comprehensive analysis of the model's rate output across protected class distributions using the Division's specified testing methodology. The rejection establishes a clear principle: predictive accuracy alone does not ensure regulatory approval. A model that predicts better but cannot demonstrate it predicts fairly, using the regulator's preferred methodology, will be rejected in favor of a less accurate model that can.

For pricing actuaries in the remaining pilot states, the Colorado precedent is instructive even where state law does not yet mirror Colorado's specific requirements. The evaluation tool's Exhibit C asks about bias testing methodology, and regulators who have seen Colorado's approach will increasingly use it as a reference point when reviewing filings in their own jurisdictions.

Model Versioning and the Materiality Threshold

A recurring question in the pilot is when model changes trigger a new rate filing. Traditional GLMs change infrequently because the model structure is relatively stable; updated coefficients from new experience data may require a filing, but the model architecture itself rarely changes. Machine learning models, by contrast, are often retrained on rolling data windows, and retraining can shift the model's predictions across the portfolio even without any change to the feature set or hyperparameters.

The NAIC guidance on model versioning, referenced in the Model Review Manual and elaborated in the evaluation tool's Exhibit C, expects carriers to maintain a full audit trail of model versions, including training data snapshots, hyperparameter configurations, and validation results. The unresolved question is where the materiality threshold sits: how much can a model's output distribution shift before the change triggers a filing obligation?

Industry practice is converging on two approaches. Some carriers define materiality by the maximum rate change any individual policyholder experiences from model retraining, with 5% as a common threshold for any single risk. Others define it by portfolio-level distribution shift, using metrics like population stability index (PSI) to measure whether the new model's output distribution has materially departed from the filed version. The NAIC has not prescribed a specific threshold, and the pilot states are collecting data on carrier practices to inform future guidance.

Pricing actuaries should document their materiality threshold methodology explicitly in the rate filing rather than waiting for the regulator to ask. A clearly articulated threshold, supported by the statistical test used to measure it, provides a defensible basis for the decision not to file when routine retraining produces immaterial changes.

Third-Party Vendor Models Under the Microscope

The NAIC Third-Party Data and Models Working Group has narrowed its draft oversight framework to focus specifically on vendors of data and models used in pricing and underwriting functions. This scoping decision means vendors supplying claims triage models or fraud detection algorithms are, for now, outside the framework's primary focus, while vendors supplying territory rating data, credit-based scoring models, or loss cost prediction algorithms are squarely within scope.

The unresolved question is whether vendor registration will be compulsory or voluntary. A compulsory registry would require vendors to file documentation with the NAIC or individual states before carriers could use their models in rate filings. A voluntary framework would place the documentation burden entirely on the carrier, requiring the pricing actuary to produce the same SHAP, PDP, and fairness testing documentation for a vendor model as for an internally developed one (Sidley Austin, April 2026).

For pricing actuaries relying on third-party models in their rate indications, the practical implication is clear regardless of which framework emerges: the carrier is responsible for the regulatory documentation, and vendor contracts must include rights to access the model's feature importance outputs, training data composition, and validation results. Contracts executed before 2025 rarely include these provisions, and renegotiation should be a near-term priority for actuarial departments using vendor models in filed rates.

What Pricing Actuaries Should Prepare Now

The 11-state pilot runs through September 2026, with revised guidance expected at the NAIC Fall National Meeting. Pricing actuaries should use this window to prepare for standardized requirements.

Build explainability into the model pipeline. SHAP value computation, partial dependence plots, and ALE plots should be standard outputs of the model development process, generated automatically alongside the model's predictions. Retrofitting explainability onto a model after it has been filed is both technically harder and less defensible than producing it during development.

Implement automated fairness testing. The disparate impact analysis should run as part of the model validation pipeline, not as a separate exercise performed when a regulator asks. Testing should use geocoded demographic data at the census-tract level and apply both adverse impact ratios and regression-based residual analysis.

Establish and document the materiality threshold. Define the metric (PSI, maximum individual rate change, or both), set the threshold, and build monitoring that flags when retraining breaches it. Include this documentation in every rate filing that uses an ML model.

Audit third-party vendor contracts. Where contracts do not include access to feature importance, training data composition, and validation results, begin renegotiation before the NAIC framework is finalized.

Maintain a model version registry. The evaluation tool's Exhibit C expects carriers to demonstrate how each model version was validated and what changed between versions. This registry should include training data date ranges, hyperparameter settings, validation metrics, and fairness testing results for each version.

The shift from GLM coefficient review to ML output testing is not a future possibility; it is happening now across 11 states. Pricing actuaries who build the documentation infrastructure during this pilot window will file with confidence when standardized requirements follow.

Sources

  1. Sidley Austin: NAIC Regulatory Update, Spring 2026 National Meeting (April 14, 2026)
  2. NAIC Casualty Actuarial and Statistical (C) Task Force: Model Review Manual (November 4, 2025)
  3. Swept AI: Predictive Model Regulation Is Coming for Insurance Rate Filings
  4. NAIC Model Bulletin on the Use of Artificial Intelligence Systems by Insurers (December 2023)
  5. NAIC Big Data and Artificial Intelligence (H) Working Group Materials
  6. NAIC Regulatory Review of Predictive Models White Paper
  7. Colorado Division of Insurance: SB 21-169 Implementation Guidance
  8. Lundberg, S. M. and Lee, S. I. "A Unified Approach to Interpreting Model Predictions." NeurIPS (2017).
  9. Apley, D. W. and Zhu, J. "Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models." Journal of the Royal Statistical Society: Series B 82(4), 1059-1086 (2020).

Further Reading