How Insurance-Native AI Platforms Reframe the Carrier Build-vs-Buy Decision

Roots Automation entered May 2026 with two numbers it wants actuaries to take seriously: 300 million proprietary insurance documents in InsurGPT's training corpus, and 115 live carrier deployments, including 3 of the top 5 P&C carriers by premium. The platform launched under a new name, Bevaya, on May 28, rebranding from the prior Roots Automation product and repositioning the company's thesis around insurance-native model training rather than general-purpose AI adapted for insurance workflows. Both numbers, if they hold up under scrutiny, answer different questions. The 300 million documents claim addresses model accuracy and domain fluency on underwriting submission parsing and claims coverage analysis tasks. The 115 deployments claim addresses something actuaries care about separately: whether there is enough production evidence to satisfy the model validation obligations ASOP No. 56 places on any actuary who relies on a third-party platform output in a pricing or reserving workflow.

The launch timing was not incidental. The NAIC's AI Systems Evaluation Tool pilot, which began as a handful of states experimenting with market examination tools, expanded to 12 states by mid-2026 and now explicitly requires carriers to document how third-party AI models were trained, what datasets were used, and what governance controls govern model updates. That documentation requirement lands directly on the training-data question Bevaya has made central to its market positioning. Carriers that selected a general-purpose large language model and fine-tuned it on internal documents now face a different documentation burden than carriers using a platform vendor who can produce a 300-million-document training corpus as evidence of domain specificity. Whether that difference is material depends on what the evaluation tool actually tests, and that is where the actuarial analysis has to go beyond the press release.

What InsurGPT Actually Is: Ensemble Architecture vs. Fine-Tuned General Model

The distinction between an insurance-native model and a fine-tuned general-purpose model is technical but consequential for the evaluation an actuary must conduct under ASOP No. 56. A fine-tuned model, typically a GPT-4 class or similar architecture pre-trained on broad internet text, is adapted for insurance tasks by training on a smaller insurance-specific dataset. The model's fundamental language understanding comes from its base pre-training; the fine-tuning layer teaches it to apply that understanding to insurance terminology, document formats, and decision workflows. A purpose-built domain model, by contrast, learns language itself from domain data, meaning its internal representations of terms like "occurrence policy" or "loss development factor" are shaped by insurance usage patterns rather than by the full spectrum of English text that happens to mention insurance.

InsurGPT, as Roots Automation describes it, is an ensemble of specialized models rather than a single fine-tuned architecture. Ensemble methods in machine learning combine multiple model outputs to produce predictions that are more robust and better-calibrated than any individual model. Applied to insurance AI, the ensemble approach allows specific sub-models to handle specific task types: one model optimized for submission parsing, another for coverage determination, another for claims classification, with an orchestration layer combining their outputs for a final prediction. Roots Automation's public documentation positions InsurGPT as built from the ground up on insurance language and workflows, placing it architecturally closer to a domain-specific model family than to a general-purpose fine-tuned system.

The performance difference between these architectures is a matter of active debate in the insurance AI community, and it is discussed at length in our analysis of the domain-trained versus general LLM debate prompted by the CAS Seminar on Reinsurance. For well-defined, structured tasks like extracting loss run data from a standardized ACORD form, a fine-tuned general-purpose model and an insurance-native model may produce nearly identical accuracy. For tasks involving ambiguous policy language, jurisdiction-specific coverage interpretations, or complex claims documents with non-standard formats, domain-specific training data becomes more valuable: the model has encountered similar patterns during training and its internal representations reflect insurance-specific meaning rather than interpolated general-text meaning.

What the Bevaya press release does not provide, and what any actuary reviewing the platform for ASOP No. 56 purposes must demand, is held-out benchmark data comparing InsurGPT outputs against a fine-tuned general-purpose model on representative samples of the carrier's own document types. Without that comparison, the training corpus size is a marketing metric. With it, the claim becomes testable, and the test result determines whether the insurance-native label justifies actuarial reliance.

115 Deployments and the ASOP No. 56 Production Evidence Standard

The 115 production deployments Roots Automation reports at launch carry specific weight under ASOP No. 56's requirements for model validation. Section 3.5 of ASOP No. 56 requires the actuary to evaluate the reasonableness of the model, including assessments of model limitations and whether the model has been used in comparable contexts. A vendor platform with 115 live deployments across the carrier market has accumulated a production track record that a carrier-built model or a recently fine-tuned general-purpose model cannot match on day one. This is the same logic that makes Sedgwick's data advantage in claims AI credible: a dataset five times larger than the nearest competitor does not just make the model more accurate in theory; it provides production evidence of accuracy across a span of claim types that new entrants cannot replicate quickly.

That track record creates a two-sided implication for the ASOP No. 56 analysis. On the validation side, 115 deployments across 3 of the top 5 P&C carriers provides a meaningful base of comparable use cases from which an actuary can, in principle, request production performance data, error rates on specific task types, and model drift statistics over the deployment period. On the governance side, the same 115 deployments represent 115 sets of carrier-specific data that have flowed through Roots Automation's platform infrastructure, raising data governance questions the actuary must explicitly address under Section 3.4 of ASOP No. 56, which covers the actuary's responsibility to understand data quality and limitations.

The third-party model governance obligation extends further when the platform uses an ensemble of sub-models. If each sub-model in InsurGPT is independently updated on new training data, and if Roots Automation pushes model updates across its installed base without requiring explicit carrier approval for each version change, the actuary faces a model stability problem: the system producing outputs in the carrier's workflow today may not be the same system producing outputs in six months. ASOP No. 56 does not prohibit dynamic model updates, but it does require the actuary to understand the frequency and magnitude of model changes and to assess whether those changes would materially affect outputs the actuary is relying on. A vendor contract that gives the carrier the right to freeze a specific model version for the duration of an annual pricing cycle addresses this risk directly. A contract without that provision leaves the actuary without a stable governance baseline for rate filings.

The AM Best survey of approximately 150 rated carriers and MGAs, which found that only 18% of carriers cited third-party model risk as a challenge despite 68% using third-party AI solutions, suggests most carrier procurement teams have not yet worked through this specific contract terms analysis. That accountability gap is where a market conduct examiner in one of the 12 NAIC pilot states will look first when reviewing AI governance documentation.

Build vs. Buy in 2026: What Changed Since 2023

The build-vs-buy calculus for insurance AI looked materially different in 2023 than it does today, and three structural shifts explain most of the change. In 2023, building a proprietary insurance AI system meant fine-tuning a GPT-class model on internal documents, an engineering effort that a mid-size carrier could reasonably scope at six to twelve months with an internal data science team of modest size. By mid-2026, that same effort includes three additional cost centers that did not exist or were not material three years ago.

The first shift is regulatory documentation burden. NAIC AI evaluation tool requirements now reach into training data and governance documentation that previously had no regulatory visibility. A carrier that built a proprietary model in 2023 typically documented the model's intended use, its validation methodology, and its performance metrics. A carrier building that same model today must also document the training dataset's provenance, any third-party data included, model update cadence, and the governance controls that ensure bias testing remains current. That documentation work, for a proprietary build, falls entirely on the carrier's actuarial and data science teams. A platform vendor like Roots Automation absorbs a portion of this overhead and can spread it across its installed base.

The second shift is retraining cost. Language models degrade on domain-specific tasks when the underlying documents change, whether through new policy form filings, evolving claims patterns, or shifts in medical coding in health lines. A vendor platform can spread the cost of retraining across 115 carrier clients; a carrier building proprietary AI absorbs the full retraining cost alone. For a system trained on hundreds of millions of documents, retraining is not a trivial infrastructure cost, and it recurs as the market evolves. The Guidewire build-vs-buy analysis applies here in a different register: just as carriers evaluating proprietary rating engine construction must account for ongoing maintenance costs rather than just initial build costs, AI platform economics require the same lifecycle view.

The third shift is actuary time for model validation. ASOP No. 56 requires the actuary to evaluate any model whose outputs are used in actuarial work products. For a proprietary build, the actuary has direct access to training data, model architecture, and performance metrics, which makes validation more thorough but also more time-consuming. For a third-party platform, the actuary must work from vendor-provided documentation, reference customer contacts, and whatever audit rights the contract provides. The total actuary time for validating a third-party platform is typically lower than for a proprietary build, but only if the vendor's documentation is detailed enough to support an ASOP No. 56 compliant review. Bevaya's 115-deployment track record improves the odds of that documentation existing; it does not guarantee adequacy.

For carriers below the top 10 by premium, the economic case for building proprietary insurance AI has weakened considerably since 2023. The regulatory documentation burden, retraining cost, and actuarial validation overhead all scale with the carrier's data science investment, and for mid-size carriers that investment cannot match the platform economics a vendor serving 115 deployments achieves. For the top 5 carriers, where the strategic value of data control and model differentiation remains high, the build case persists, but it requires a more explicit accounting of the regulatory documentation burden than most carriers have budgeted. The comparison between Allstate's proprietary ALLIE stack and platform-oriented approaches like State Farm's OpenAI Frontier partnership captures both ends of this spectrum: large carriers are not converging on a single model.

The NAIC Third-Party Vendor Registry and Its Asymmetric Effect on Vendors

The NAIC's proposed Third-Party Model Vendor Registry adds another dimension to the vendor selection analysis. As described in regulatory analysis from Swept AI and the Fenwick law firm's review of the evaluation pilot's vendor implications, the registry framework would require AI vendors serving the insurance market to register their models, provide training data descriptions, and submit to standardized documentation reviews before their platforms could be used by carriers subject to state market conduct examination. The registry is proposed, not finalized, and its timeline for formal adoption through the NAIC's Big Data and Artificial Intelligence Working Group depends on how quickly states adopt the evaluation tool itself.

The registry proposal distinguishes, implicitly, between general-purpose AI vendors and insurance-native vendors in a way that affects their relative documentation burden. A general-purpose platform provider would need to document the insurance-specific customization applied to a base model, plus the base model's training data composition, which foundation model labs do not typically release in detail. A carrier using a version of GPT-4 fine-tuned on its own documents through an enterprise API relationship would face a registry documentation gap that the pure insurance-native vendor sidesteps: the base model's training data is not the carrier's or the fine-tuning vendor's to disclose.

For a platform like Bevaya, whose entire value proposition rests on the training corpus, a formal registry process where that corpus can be reviewed and validated against stated documentation would actually strengthen the marketing claim. If Roots Automation's 300 million proprietary insurance documents can be confirmed through a standardized registry process, the insurance-native credential moves from assertion to verified status. That verification is worth more than any number of press releases. The NAIC's Spring 2026 National Meeting framework for distinguishing agentic AI as a regulatory category creates the governance structure into which a vendor registry would slot; the two initiatives are complementary rather than redundant.

This regulatory dynamic shifts the vendor selection calculus in a specific direction: vendors whose training data governance is most transparent gain a regulatory advantage as state oversight deepens. That is not a prediction that general-purpose models will be excluded from carrier use; it is a prediction that carriers using them will bear higher compliance overhead than carriers using vendors who can produce complete training documentation on demand. The AIG and McGill agentic AI follow-market arrangement is an example of a carrier-level response to this overhead: structured partnerships that route third-party AI through a documented governance layer reduce the compliance gap between insurance-native and general-purpose approaches by adding explicit model documentation at the carrier-partner boundary.

Applying ASOP No. 56 to a Black-Box Ensemble: The Three-Stage Review

Section 3.1 of ASOP No. 56 requires the actuary to have a clear understanding of the model's purpose and the methods and assumptions used in its construction. For a black-box ensemble like InsurGPT, "clear understanding" is a standard that requires explicit documentation from the vendor rather than reliance on product marketing materials. Roots Automation's launch documentation describes the platform's capabilities but does not, in publicly available materials, provide the sub-model architecture details, training dataset composition at the document type level, or the ensemble combination methodology that an ASOP No. 56 review requires.

In practice, the actuary's ASOP No. 56 review of a third-party platform like Bevaya proceeds in three stages. The first stage is vendor due diligence: obtaining from Roots Automation the documentation packages that describe each sub-model's training data, validation benchmarks on held-out insurance datasets, known failure modes, and model update governance. This documentation should be incorporated into the actuarial work file as evidence of the reliance exercise required under ASOP No. 56 Section 3.7. The EXL AI patent portfolio provides a useful benchmark for what thorough vendor model documentation looks like in practice: EXL's insurance infrastructure patents describe model architectures, training methodologies, and domain-specific data pipelines with the specificity that an actuarial due diligence review requires.

The second stage is use-case scoping. ASOP No. 56 does not require identical validation standards for all model uses; it requires validation proportionate to the materiality of the model's outputs on the actuarial work product. A carrier using Bevaya to classify incoming submission documents for routing, where the model's output affects workflow efficiency but not the ultimate underwriting decision, faces a lighter validation burden than a carrier using Bevaya's coverage determination outputs to inform reserve estimates flowing into Schedule P. The actuary must explicitly document which use cases meet the materiality threshold that triggers a more intensive validation review.

The third stage is ongoing monitoring. ASOP No. 56 requires the actuary to have a plan for monitoring a model's continued appropriateness. For a vendor platform with 115 deployments and regular model updates, this monitoring plan must address how the carrier will detect when a Roots Automation model update changes output distributions in ways that affect the actuarial work products depending on those outputs. The most practical mechanism is a holdout sample of documents where ground-truth labels are known and against which the platform's outputs can be tested each time Roots Automation notifies the carrier of a model version change. Without that testing cadence, the actuary's monitoring plan is aspirational rather than operational.

None of these steps are obstacles to adopting Bevaya in a carrier's pricing or reserving workflow. They are the documentation chain that makes the adoption defensible to a state market conduct examiner or a peer reviewer of the actuarial opinion. Carriers and actuaries that skip the documentation chain on the assumption that a vendor's deployment track record substitutes for individual validation are misreading ASOP No. 56. The track record is evidence the actuary can reference; it is not a validation waiver.

The Data Moat: 300 Million Documents vs. Open-Source Convergence

Whether 300 million proprietary insurance documents constitutes a defensible competitive advantage depends on a question the insurance AI market has not yet answered: how much does training data composition affect model performance on tasks that vary across carriers, and how quickly is that advantage eroded by open-source model improvement?

The 2024 to 2026 period demonstrated that open-source foundation models close performance gaps with proprietary models faster than most technology analysts predicted. The release of the Llama model family and comparable open architectures gave carriers direct access to large language model base weights that could be fine-tuned on proprietary data without licensing a foundation model from a commercial lab. For carriers with substantial in-house data science capability, the combination of open-source model weights and internally held insurance documents may provide a comparable training signal to what Bevaya's corpus offers, at a lower licensing cost. The Insurity vendor audit documented how this "AI-native" claim plays out in the core system vendor market, where vendor marketing frequently outpaces demonstrable delivery.

The counterargument is scale. A single large carrier may hold 20 to 30 million insurance documents across its policy administration, claims, and underwriting systems. That is a meaningful training dataset, but it represents one carrier's slice of the market, with document types, claim patterns, and coverage language particular to its own book of business. A 300-million-document corpus, assembled from multiple carriers over several years of platform operation, spans a wider range of policy forms, claim types, and actuarial language patterns than any single carrier's data alone. Multi-carrier training captures edge cases that single-carrier data misses, and edge cases are precisely where language models trained on insufficient domain data most often fail on coverage determinations and claims severity assessments.

The third dimension is recency. A training corpus assembled in 2024 on commercial auto documents may not adequately represent coverage disputes emerging in 2026 around autonomous vehicle liability endorsements, telematics-based rating modifications, or new loss types that standard policy forms address ambiguously. For training data to remain a defensible moat, it requires continuous acquisition of new documents as the insurance market evolves. Roots Automation's platform generates new training signal from its 115 active deployments, creating a feedback loop where production use adds to the training corpus in ways a static dataset cannot replicate. That dynamic accumulation, rather than the 300 million initial documents, is the more durable competitive argument. The comparison to Sedgwick Omni's dataset advantage in claims is instructive: the moat is not the initial size but the ongoing accumulation rate that active deployments enable.

Whether open-source model improvement erodes this advantage within two to three years depends on whether the next generation of open-source architectures achieves equivalent multi-carrier breadth through publicly available insurance text rather than proprietary documents. The publicly available insurance text on the internet, regulatory filings, NAIC publications, court decisions on coverage disputes, and academic actuarial literature, is substantial but not equivalent to the internal document types that drive most underwriting and claims AI tasks. Policy language, loss run formats, claims adjuster notes, and underwriting questionnaire responses are not publicly available at scale. That limits the ceiling for open-source models without access to carrier-internal data, at least for the task types where insurance-native training data is most valuable. The data moat is narrower than Roots Automation's marketing implies and wider than open-source advocates acknowledge; the honest answer is task-specific.

The Actuarial Evaluation Checklist for Bevaya and Comparable Platforms

Actuaries assessing whether to rely on InsurGPT outputs in actuarial work should request the following from Roots Automation before incorporating platform outputs into a pricing or reserving work product. First, the training corpus composition by document type: what percentage of the 300 million documents are policy forms versus claims files versus underwriting questionnaires, and how are the proportions distributed across lines of business? An ensemble trained heavily on personal lines documents may not perform equivalently on specialty commercial tasks even if the headline document count is large.

Second, held-out benchmark performance on the carrier's specific task types: if the carrier's primary use case is construction wrap-up submission parsing, the actuary needs benchmark data on that specific document type rather than aggregate platform accuracy statistics. Aggregate accuracy figures can obscure performance gaps on the specific tasks that drive actuarial work product outcomes.

Third, model update frequency and notification procedures: the contract should specify how and when Roots Automation notifies carriers of model version changes, what the carrier's right is to defer an update during an active rate filing cycle, and what rollback rights exist if an update degrades performance on previously validated tasks.

Fourth, the vendor's own ASOP No. 56 positioning documentation: Roots Automation, operating in a market where 68% of carriers use third-party AI but only 18% track vendor risk, should be able to provide a documentation package specifically designed to support the actuary's Section 3.7 reliance disclosure. Vendors that cannot produce this documentation should be treated as offering supplementary information tools rather than primary model inputs for actuarial work products.

If the vendor cannot produce documentation adequate to support the actuary's reliance disclosure, the correct response is not to decline adoption entirely. It is to restrict the platform's outputs to non-actuarial use cases, use them as supplementary information with independent verification, and require the vendor to develop adequate documentation as a condition of expanding the actuarial use scope. That is a governance posture, not a procurement rejection.

Why This Matters for Actuarial Practice

The Bevaya launch and the NAIC evaluation pilot's documentation requirements converge on a single obligation that actuaries carrying model reliance in their work products now face more explicitly than they did two years ago: document the vendor's training data provenance before signing off on outputs that affect reserve estimates, rate filings, or risk selections. The insurance-native versus general-purpose distinction is not an academic question about model architecture. It is a documentation question about what evidence supports the actuary's reliance claim under ASOP No. 56 and survives a market conduct examiner's review in one of the 12 pilot states.

Three specific implications follow. Vendor selection is now a governance decision with actuarial sign-off requirements, not purely a technology decision owned by the CIO. When a carrier's pricing team adopts Bevaya for submission triage, an actuary who relies on those outputs in a pricing work product inherits a validation obligation that cannot be delegated back to the technology team. Contract terms matter actuarially: the right to receive model update notifications, to freeze a specific version during an annual rate filing cycle, and to access benchmark performance data are governance mechanisms, not legal boilerplate. And the NAIC evaluation tool's 12-state expansion creates a near-term deadline: carriers in pilot states that have not yet documented their third-party AI training data governance should treat the 2027 examination cycle as the point at which that documentation gap becomes a market conduct finding.

Bevaya's launch positions Roots Automation as the clearest current test of whether insurance-native training data, documented at scale, can earn the regulatory and actuarial confidence that general-purpose vendors must earn through a different, and currently more expensive, governance path. Whether that positioning holds depends not on the 300 million number in the press release but on whether the documentation package produced in response to actuarial due diligence requests survives the scrutiny ASOP No. 56 and the NAIC pilot now make unavoidable.

Sources