What is synthetic data in actuarial ratemaking?

Synthetic data in actuarial ratemaking is artificially generated data that preserves the statistical properties of a real insurance dataset, including univariate distributions, bivariate correlations, and multivariate relationships between rating variables, while containing no actual policyholder personally identifiable information. The 2025 CAS Ratemaking Prize-winning paper demonstrated that kernel density estimation can generate synthetic datasets with high fidelity for pricing model development.

Has any state insurance department accepted a rate filing based on synthetic data?

As of June 2026, no state department of insurance has formally accepted a rate filing where synthetic data was cited as a primary modeling input. Rate filings submitted through SERFF are evaluated against adequacy, non-excessiveness, and non-discrimination standards, and regulators have not yet issued guidance on how synthetic data inputs should be documented or validated in the filing process. This regulatory acceptance gap is the primary bottleneck for production deployment.

How does ASOP No. 56 apply to synthetic data in actuarial models?

ASOP No. 56 (Modeling) requires actuaries to document data used as inputs to models and to assess whether data is consistent with the intended purpose. When synthetic data replaces or supplements real policyholder data, the actuary must document the generation methodology, validate that statistical properties are preserved, disclose the substitution to intended users, and assess whether the synthetic data introduces bias or distortion that affects model output. The standard does not prohibit synthetic data but requires the actuary to exercise professional judgment about its suitability.

Synthetic Data Wins CAS Ratemaking Prize: Privacy-Safe Pricing With Kernel Density Estimation

The CAS E-Forum published a prize-winning paper proving that kernel density estimation can generate synthetic insurance datasets that preserve statistical pricing relationships while eliminating policyholder PII. With 79% of carriers open to synthetic data and state privacy laws tightening, this is the first rigorous actuarial framework for deploying AI models without exposing real data. No state DOI has formally accepted a synthetic-data-based rate filing.

From reviewing dozens of rate filing exhibits across multiple states, we have seen zero instances where a carrier cited synthetic data as a modeling input. That absence is exactly what makes the CAS's decision to award its 2025 Ratemaking Prize to a synthetic data paper a potential turning point for how actuaries handle the tension between data utility and policyholder privacy.

Noa Zamstein, a senior data science researcher at Earnix, received the $15,000 CAS 2025 Ratemaking Prize for her paper "Enhancing Actuarial Ratemaking with Synthetic Data for Privacy Preservation," published in CAS E-Forum Quarter 1, 2025. She presented the work at the Ratemaking, Product and Modeling (RPM) Seminar in Orlando, Florida, on March 10, 2025. The paper demonstrates that a kernel density estimation (KDE) approach can produce synthetic insurance datasets that maintain the univariate distributions, bivariate correlations, and multivariate relationships needed for pricing model development, while containing no actual policyholder personally identifiable information.

The timing matters. SAS survey data cited by Roots Automation found that 79% of carriers are open to using, or are already employing, synthetic data to resolve privacy and data-quality challenges. The CCPA's automated decision-making provisions took full effect in 2026, requiring risk assessments for processing activities that involve significant decisions about consumers, including insurance pricing. Twenty-four states have adopted the NAIC Model Bulletin on AI use by insurers, with its data governance requirements covering data quality, integrity, bias, and privacy of non-public information. Synthetic data sits at the intersection of all three pressures.

$15K

CAS 2025 Ratemaking Prize award for Zamstein's synthetic data paper

79%

Carriers open to synthetic data for privacy and data-quality challenges (SAS survey)

States that adopted the NAIC Model Bulletin on AI use by insurers as of March 2025

The KDE Methodology: How It Works for Actuarial Data

Kernel density estimation is a non-parametric statistical technique for estimating the probability density function of a random variable from observed data. In the ratemaking context, KDE treats each data point in the original insurance dataset as the center of a small probability distribution (the "kernel," typically Gaussian). The sum of all these kernels produces a smooth estimate of the underlying distribution from which the data was drawn. Synthetic records are then sampled from this estimated distribution rather than copied from the original data.

Zamstein's approach extends univariate KDE to handle the multivariate structure inherent in insurance data. Ratemaking datasets contain correlated rating variables: driver age correlates with vehicle type, which correlates with territory, which correlates with loss frequency. A naive column-by-column synthesis would destroy these correlations and render the synthetic data useless for building generalized linear models (GLMs) or gradient boosting models. The paper evaluates fidelity across three levels of statistical complexity: univariate distribution matching, bivariate relationship preservation, and multivariate structure retention.

The key finding is that the KDE approach achieves high fidelity across all three levels on actuarial datasets. The synthetic data preserves the distributional shape of individual rating variables, maintains the pairwise relationships between covariates, and supports comparable GLM coefficient estimates when models are trained on synthetic versus original data. This last point is critical for practical adoption: if a pricing actuary can fit a GLM on synthetic data and obtain coefficient estimates close to what the original data would produce, the synthetic dataset is functionally equivalent for model development purposes.

Earnix, Zamstein's employer, has independently validated this finding using a separate auto insurance renewal database of approximately 40,000 observations. Their testing showed that gradient boosting models trained on synthetic data produced comparable feature importance rankings to models trained on original data, and that price elasticity distributions proved sufficiently similar between synthetic and original datasets to support scenario testing. This cross-validation across different datasets and modeling techniques strengthens the case that KDE-based synthetic data is not a dataset-specific artifact but a generalizable methodology.

KDE vs. GANs vs. VAEs vs. Differential Privacy: A Practitioner's Comparison

The CAS prize paper does not exist in isolation. A growing body of research compares synthetic data generation methods for insurance applications, and actuaries evaluating these tools need to understand the tradeoffs. A September 2025 arXiv paper by Havrylenko, Käärik, and Tuttar, titled "Synthetic data for ratemaking: imputation-based methods vs adversarial networks and autoencoders," provides the most direct head-to-head comparison available.

That paper evaluates three approaches on an open-source insurance dataset: multivariate imputation by chained equations (MICE, an imputation-based approach conceptually similar to KDE in its statistical foundations), conditional tabular generative adversarial networks (CTGANs), and variational autoencoders (VAEs). Their findings align with and extend Zamstein's: the imputation-based approach produced high-fidelity synthetic tabular data while requiring lower implementation complexity than the deep learning alternatives.

Method	Fidelity	Privacy Guarantee	Complexity	Insurance Fit
Kernel Density Estimation (KDE)	High across univariate, bivariate, and multivariate structure	Empirical (no formal epsilon bound)	Low to moderate; standard statistical libraries	Strong for tabular rating data; CAS prize-validated
MICE (Imputation-Based)	High; preserves GLM coefficient comparability	Empirical	Low to moderate; well-understood in actuarial practice	Strong for tabular data with mixed variable types
Conditional Tabular GAN (CTGAN)	High for complex non-linear relationships	Empirical; DP-CTGAN variant adds formal bounds	High; requires substantial data and extensive tuning	Better for large datasets with non-linear structure
Variational Autoencoder (VAE)	Moderate to high; encodes to latent space	Empirical; DP-TVAE variant adds formal bounds	High; challenges applying text-based architectures to tabular data	Emerging; less validated on insurance-specific data
Differential Privacy (noise injection)	Reduced; noise degrades statistical relationships	Formal mathematical guarantee (epsilon-delta)	Moderate; parameter tuning affects utility tradeoff	Treats features independently; can miss dependencies

The practical takeaway for pricing actuaries is that KDE and MICE offer the best combination of fidelity and accessibility for typical ratemaking datasets. These methods produce tabular synthetic data that preserves the cross-variable correlations essential for GLM-based pricing without requiring deep learning infrastructure or specialized ML engineering staff. GANs and VAEs may outperform on very large datasets with complex non-linear interactions, but the implementation burden and training instability (mode collapse in GANs, posterior collapse in VAEs) make them harder to justify for standard ratemaking workflows.

Differential privacy provides the strongest formal privacy guarantee: a mathematical bound on how much any single record can influence the output. But this guarantee comes at a cost. Noise injection treats features independently, which can break the correlations between rating variables. For a ratemaking dataset where the relationship between age, territory, and loss frequency is the entire point of the analysis, destroying those correlations defeats the purpose. Privacy-utility tradeoff research consistently finds that achieving strong formal privacy bounds (small epsilon values) degrades downstream model accuracy, while weaker bounds (large epsilon) provide less meaningful privacy protection.

The Privacy-Utility Tradeoff: What Fidelity Metrics Actuaries Should Track

Synthetic data is only useful if it is statistically faithful to the original. If the synthetic generation process distorts the data enough to change pricing model outputs, the actuary has traded privacy risk for model risk, which is not an improvement. Zamstein's paper and the broader literature identify several fidelity dimensions that actuaries should evaluate before relying on synthetic data for any modeling purpose.

Marginal distribution matching. Each rating variable in the synthetic dataset should have a distribution that closely matches the original. For continuous variables like age or annual mileage, this can be assessed using the Kolmogorov-Smirnov test or by comparing kernel density plots. For categorical variables like territory or vehicle class, chi-squared tests or frequency table comparison provide the relevant metrics.

Bivariate relationship preservation. The correlation structure between pairs of variables must be maintained. If age and claim frequency are correlated in the original data, the synthetic data must preserve that correlation at comparable magnitude. Correlation matrices, cross-tabulations, and pairwise scatter plots serve as validation tools.

Multivariate GLM equivalence. The most demanding test: does a GLM trained on synthetic data produce coefficient estimates, standard errors, and predicted values comparable to the same model specification trained on original data? This is the metric that matters most for ratemaking because it directly measures whether the synthetic data supports the same pricing conclusions as the real data.

Feature importance stability. For tree-based models (gradient boosting, random forests) used increasingly in ratemaking, the ranking and relative magnitude of feature importances should be stable between models trained on synthetic versus original data. Earnix's testing on the 40,000-observation auto insurance dataset confirmed this stability for gradient boosting models.

Privacy metrics. On the privacy side, membership inference tests measure whether an attacker with access to the synthetic data and knowledge of a specific individual can determine whether that individual was in the original dataset. Recent work on quantifying membership disclosure risk using kernel density estimators provides a principled framework for this assessment. The goal is a synthetic dataset where membership inference attacks perform no better than random guessing.

From tracking the development of these validation approaches, the pattern is clear: no single metric suffices. Actuaries evaluating synthetic data for production use should apply the full battery of univariate, bivariate, multivariate, and privacy metrics before making any reliance determination. This is consistent with the professional judgment framework in ASOP No. 56, which requires actuaries to assess whether data used as model inputs is consistent with the intended purpose.

The Regulatory Acceptance Gap: Zero State DOIs Have Blessed Synthetic-Data Filings

This is the central bottleneck. The CAS has validated the methodology through its prize selection process. Carrier data science teams are experimenting with synthetic data generation. But no state department of insurance has formally accepted a rate filing where synthetic data was cited as a primary modeling input, and no state has issued guidance on how synthetic data should be documented or validated in the filing process.

Rate filings submitted through SERFF are evaluated against three standards: rates must be adequate, not excessive, and not unfairly discriminatory. Reviewing actuaries at state DOIs assess whether the data, methodology, and assumptions underlying a rate indication are sound. When that data is synthetic rather than actual policyholder experience, the reviewing actuary faces a novel question: how do you validate that artificial data supports real pricing conclusions?

The NAIC's AI evaluation tool for predictive model rate filings, which 12 states piloted in 2025, focuses on model transparency, bias testing, and outcome validation. It does not address synthetic training data as a separate category requiring specific documentation or testing. The NAIC Model Bulletin on AI use by insurers, adopted by 24 states, requires data governance controls covering data quality, integrity, bias, and suitability, but does not specifically reference synthetic data generation as a distinct data practice.

This creates a practical problem for actuaries. An actuary who builds a pricing model using synthetic data can demonstrate that the model performs comparably to one built on real data. But when filing that model, the actuary must represent the data inputs in the filing documentation. If the filing exhibit says "model trained on synthetic data generated via KDE from Company X's policyholder records," the reviewing actuary has no established framework for evaluating that statement. There is no precedent filing to compare against, no bulletin to cite, and no SERFF field designed for this disclosure.

The path forward likely runs through the CAS and SOA research infrastructure rather than through direct regulatory action. If the actuarial organizations publish practice notes or research reports establishing validation standards for synthetic data in ratemaking, those documents give reviewing actuaries a basis for evaluating filings. The iCAS Data Science and Analytics Forum at RPM 2026, held in Chicago on March 16-18, included call-for-presentation topics directly relevant to this question. As more research accumulates and practitioners develop standardized validation protocols, the regulatory acceptance gap should narrow.

ASOP No. 56 and Documentation Requirements for Synthetic Data

ASOP No. 56 (Modeling), effective since October 2020, provides the actuarial standards framework most directly applicable to synthetic data use. The standard requires actuaries to document data used as inputs to models and to assess whether that data is consistent with the intended purpose. It applies to any actuary whose professional judgment determines that model output will have a material effect on the intended user.

Several provisions of ASOP No. 56 bear directly on synthetic data:

Section 3.2 (Data). The actuary should consider the appropriateness of data used in the model, including the reasonableness, comprehensiveness, and consistency of the data. When synthetic data replaces or supplements real policyholder data, the actuary must assess whether the synthetic generation process preserved the properties that make the data appropriate for the modeling purpose. This assessment should be documented.

Section 3.4 (Model Testing). The actuary should perform model testing appropriate to the model's intended purpose. For models trained on synthetic data, testing should include comparison of model outputs (coefficients, predictions, loss ratios) between synthetic-trained and real-data-trained versions. Any material differences should be documented and their implications assessed.

Section 3.7 (Reliance on Others). If the synthetic data was generated by a data science team, a vendor, or an external party, the actuary must document the reliance and assess the qualifications and work product of the party. This is particularly relevant when the generation methodology (KDE, GAN, VAE) is outside the actuary's direct expertise.

Section 4.1 (Documentation). Documentation should be sufficient for another actuary to evaluate the work. For synthetic data, this means documenting the generation methodology, the fidelity validation results, the privacy assessment, and the rationale for using synthetic rather than real data. The documentation burden for synthetic data is higher than for real data because the actuary must justify both the data itself and the process that created it.

The gap in ASOP No. 56 is that it predates the practical availability of high-quality synthetic data generation tools. The standard's framers were not contemplating a world where actuaries would routinely substitute artificial data for policyholder experience data. A parallel compliance gap exists for machine learning in loss reserving, where ASOP requirements designed for traditional methods do not map cleanly onto ML model development workflows. The Actuarial Standards Board may eventually issue additional guidance, but in the interim, actuaries using synthetic data must apply the existing ASOP No. 56 framework with careful professional judgment.

Privacy Law Pressure: CCPA, State Insurance Privacy, and the EU AI Act

The business case for synthetic data is not purely technical. Regulatory pressure on how carriers handle policyholder data is intensifying on multiple fronts, and synthetic data offers a path to comply without sacrificing analytical utility.

CCPA and California consumer privacy. The California Consumer Privacy Act's automated decision-making provisions, which took full effect in 2026, require businesses to conduct risk assessments for processing activities that involve significant decisions about consumers. Insurance pricing qualifies. The CCPA also imposes data minimization requirements: businesses should collect and process only the personal information reasonably necessary for the disclosed purpose. Synthetic data eliminates the need to process actual personal information for model development, testing, and validation, directly supporting data minimization compliance. California announced its largest CCPA penalty to date in its first data minimization enforcement action, signaling that these provisions carry real enforcement weight.

State insurance privacy laws. Insurance companies operate under a patchwork of state privacy requirements that go beyond CCPA. The Gramm-Leach-Bliley Act exemption for financial institutions is narrower than many carriers assume: it covers personal information collected in connection with providing a financial product or service, but does not cover information collected through websites, marketing activities, mobile applications, or IoT devices like telematics. As carriers collect richer data streams for pricing models, an increasing share of that data falls outside GLBA protections and under broader state privacy laws. Synthetic data generated from the protected dataset can be shared, stored, and processed with reduced regulatory exposure.

EU AI Act. For cross-border carriers and reinsurers, the EU AI Act classifies insurance pricing AI as a high-risk system. The remaining provisions become applicable on August 2, 2026, with rules for high-risk areas like insurance access set for December 2, 2027. High-risk AI systems must meet data governance requirements: training, validation, and testing datasets must be relevant, sufficiently representative, and free of errors and complete according to the intended purpose. Synthetic data generation offers a path to meet the representativeness requirement while avoiding the privacy exposure of distributing real policyholder data across EU jurisdictions for model training. The Act also mandates labeling of synthetic content in a machine-readable format where technically feasible, which creates a documentation requirement but not a prohibition.

The convergence of these three regulatory streams creates a practical incentive structure: carriers that develop robust synthetic data capabilities now will be better positioned to comply with tightening privacy requirements while maintaining their data-driven pricing advantage. Carriers that continue to rely entirely on real policyholder data for model development, testing, and sharing face increasing friction as privacy enforcement intensifies.

Practical Use Cases Beyond Model Training

The CAS prize paper focuses on ratemaking model development, but synthetic data has broader applications across the actuarial workflow.

Vendor collaboration. Carriers frequently share data with InsurTech vendors, consulting firms, and reinsurers for modeling, benchmarking, and analysis. Every data-sharing arrangement creates a privacy risk vector. Synthetic data allows carriers to share statistically representative datasets with external partners without transmitting actual policyholder records. Data-sharing agreements become simpler, regulatory review cycles shorten, and the carrier eliminates the reputational and legal risk of a vendor data breach exposing real customer information.

Data enrichment and augmentation. Synthetic data can augment existing datasets to simulate underrepresented segments. If a carrier is expanding into a new territory or vehicle class where it has limited historical data, synthetic data generated from similar segments can provide initial modeling inputs. Earnix describes this as enabling expansion scenarios, such as modeling higher-valued vehicles for regional expansion, and addressing rare events like specific fraud types through oversampling. This application is particularly valuable in commercial lines, where thin data segments are common.

Regulatory sandbox testing. Carriers developing new AI-driven products or pricing approaches can use synthetic data to test regulatory filing strategies without exposing proprietary policyholder data during the pre-filing consultation process. If a state DOI requests to see the carrier's data and methodology during a pre-filing review, synthetic data provides a way to demonstrate the approach without disclosing actual competitive pricing data.

Actuarial exam and education materials. The SOA and CAS currently use sanitized or simulated datasets for exam problems and educational content. KDE-generated synthetic datasets based on real insurance data would be more realistic than hand-constructed examples, improving the training value for candidates while maintaining complete privacy protection.

What Needs to Happen Next

The CAS Ratemaking Prize signals that the actuarial profession's research community considers synthetic data a serious and validated approach. But several developments are needed before synthetic data moves from research papers to production rate filings.

Standardized validation protocols. The actuarial organizations (CAS, SOA, or the Academy) should publish practice notes establishing minimum fidelity and privacy validation standards for synthetic data used in actuarial work products. These standards would give both practicing actuaries and regulatory reviewing actuaries a common framework for evaluating synthetic data quality. The iCAS Data Science and Analytics Forum is a natural venue for developing these protocols.

Regulatory pilot programs. One or more state DOIs should run a pilot program accepting rate filings that include synthetic data as a documented modeling input, with enhanced validation requirements. This mirrors the approach the NAIC took with its AI evaluation tool pilot across 12 states, starting with voluntary participation before establishing mandatory requirements.

ASOP guidance update. The Actuarial Standards Board should consider whether ASOP No. 56 needs supplemental guidance addressing synthetic data specifically, or whether the existing framework is sufficient when applied with appropriate professional judgment. The current ASOP development pipeline already includes multiple updates addressing data and modeling practices; synthetic data could be addressed within those workstreams.

Cross-carrier benchmarking. Industry organizations like ISO/Verisk, AAIS, or the CAS Research Committee could commission studies comparing synthetic data generation methods across multiple carrier datasets and lines of business, establishing which methods work best for which data characteristics. This would give carriers empirical guidance rather than requiring each to conduct its own evaluation from scratch.

Why This Matters for Actuaries

The actuarial profession is at an inflection point where the tools for data-driven pricing are advancing faster than the regulatory and professional standards frameworks that govern their use. Synthetic data is a case study in this dynamic.

The technical capability exists. Zamstein's prize-winning research and the supporting literature demonstrate that KDE-generated synthetic data can preserve the statistical properties actuaries need for pricing model development. The carrier interest is there: 79% openness to synthetic data is a striking figure for an industry that moves cautiously on new methodologies. The privacy pressure is intensifying: CCPA enforcement actions, 24-state NAIC AI bulletin adoption, and the EU AI Act's high-risk classification for insurance pricing all push carriers toward privacy-preserving alternatives.

What is missing is the regulatory acceptance layer. Until at least one state DOI processes a filing where synthetic data is a documented input and signals that such filings can be evaluated under existing standards, adoption will remain experimental. Actuaries who want to be ahead of this curve should be evaluating KDE and imputation-based synthetic generation methods on their own data now, building the validation frameworks that will be needed when regulatory acceptance catches up to the technical capability, and documenting their work in the rigorous manner ASOP No. 56 requires.

The CAS does not award its Ratemaking Prize to incremental work. This prize signals that the profession's research leadership considers synthetic data a substantive contribution to how actuaries can do their work. The gap between that signal and production deployment is now a matter of regulatory infrastructure, not technical feasibility.