AI-Human Agreement Rates Emerge as Carrier Governance KPIs

From reviewing carrier AI disclosures across 40-plus earnings calls and regulatory filings in the past 12 months, AIG's 88% agreement metric stands out as the most specific governance data point any top carrier has voluntarily shared. Most carriers describe AI deployments in qualitative terms: "promising results," "significant efficiency gains," "enhanced decision quality." AIG's CEO Peter Zaffino did something different during the May 1, 2026 earnings call. He put a number on it. That number, and the methodology behind it, deserves close examination because it may preview the metric structure regulators will eventually require.

The insurance industry faces a measurement problem. Carriers deploying AI in claims, underwriting, and fraud detection need governance KPIs that satisfy three audiences simultaneously: regulators who want evidence of consumer protection, boards who want proof of return on investment, and the actuarial teams responsible for validating model outputs under ASOP No. 56. Traditional machine learning metrics like accuracy, precision, recall, and F1 scores speak fluently to data scientists but leave regulators and board members struggling with interpretation. Agreement rates, which measure how often AI outputs match the independent judgment of qualified human professionals, offer a metric that all three audiences can parse without technical translation.

What AIG's 88% Agreement Rate Actually Measures

During AIG's Q1 2026 earnings call on May 1, Zaffino described the evaluation methodology with unusual specificity. A professional claims adjuster reviewed 100 claims, ranking each as fraudulent or legitimate with documented reasoning. Anthropic's Claude model independently assessed the same 100 claims with no access to the adjuster's work. The two sets of determinations were then compared.

Zaffino's exact words: "Claude's determination aligned with the adjusters 88% of the time, a very strong baseline for an out-of-the-box model with no claim-specific tuning." Three elements of this statement carry particular weight for governance purposes.

First, "out-of-the-box" means the model received no fine-tuning on AIG's proprietary claims data. The 88% figure represents Claude's general reasoning capability applied to insurance fraud indicators: timeline inconsistencies, geolocation mismatches, linguistic fingerprints, prior claim patterns, document tampering signals, and coverage gaps. A tuned model deployed in production should, in theory, perform better. That makes 88% a floor rather than a ceiling for what AIG's eventual production system might achieve.

Second, the sample size of 100 claims warrants actuarial scrutiny. For a binary classification task (fraudulent vs. legitimate), 100 observations produce a 95% confidence interval of roughly plus or minus 6.4 percentage points around the observed 88%. The true agreement rate could plausibly fall anywhere between approximately 82% and 94%. Carriers building governance reporting around agreement rates will need substantially larger sample sizes to narrow confidence intervals to levels that satisfy regulatory precision expectations. A sample of 1,000 claims would reduce that interval to plus or minus 2 percentage points, and 10,000 would bring it under 1 point.

Third, the evaluation appears to have been conducted in partnership with Anthropic rather than as a purely internal AIG exercise. This distinction matters for governance documentation. Regulators evaluating AI performance disclosures will ask whether the testing was conducted by an independent party, by the model vendor, or by the carrier's own team. Each configuration carries different credibility weight in an audit context, with independent third-party validation scoring highest under the NAIC's emerging evaluation framework.

Beyond the fraud detection metric, Zaffino disclosed AIG Assist results from Lexington middle market property: a 30% improvement in quoting volume, a 55% reduction in time-to-quote, and approximately 40% increase in submission binding. These operational metrics complement the agreement rate by showing downstream business impact, but they measure efficiency rather than judgment quality. The 88% figure uniquely addresses the governance question: does the AI reach the same conclusion a qualified human would?

Agreement Rates vs. Traditional AI Performance Metrics

The distinction between agreement rates and conventional model performance metrics is both technical and practical. Understanding that distinction clarifies why agreement rates may gain regulatory traction faster than metrics the data science community considers more rigorous.

Accuracy, precision, recall, and F1 scores all measure model performance against a labeled ground truth dataset. They answer the question: "How well does the model match the correct answer?" This framing assumes a definitive correct answer exists for every prediction. In insurance claims adjudication, fraud detection, and underwriting risk assessment, definitive correct answers frequently do not exist. Two experienced adjusters can legitimately disagree on whether a claim is fraudulent based on the same evidence. The "ground truth" in these domains is itself a human judgment call, not an objective fact.

Agreement rates reframe the question. Instead of asking "Is the model right?", they ask "Does the model reach the same conclusion a qualified professional would?" This framing acknowledges that expert judgment is the operational standard, not mathematical truth. When a claims adjuster and an AI model both classify a claim as legitimate, that concordance produces a governance-relevant signal regardless of whether the claim ultimately proves fraudulent years later.

Metric	What It Measures	Audience	Governance Utility
Accuracy	Correct predictions / total predictions	Data scientists	Low: misleading with imbalanced classes (fraud rates of 5-10%)
Precision	True positives / predicted positives	Data scientists, model ops	Medium: meaningful for false-positive costs but requires technical context
Recall	True positives / actual positives	Data scientists, compliance	Medium: captures missed fraud but needs pairing with precision
F1 Score	Harmonic mean of precision and recall	Data scientists	Low: composite metric that obscures the tradeoffs boards need to see
Raw Agreement Rate	Concordance between AI and expert human	All three audiences	High: intuitive, comparable across functions, audit-friendly
Cohen's Kappa	Agreement adjusted for chance concordance	Statisticians, actuaries	High: corrects for base-rate inflation in agreement percentages

Cohen's Kappa deserves particular attention from actuaries evaluating agreement metrics. Raw agreement percentages can be inflated by base rates. If 90% of claims are legitimate, a model that classifies everything as legitimate achieves 90% raw agreement with any adjuster who also processes mostly legitimate claims. Kappa corrects for this by measuring agreement beyond what chance alone would produce. Jacob Cohen's original scale suggests that Kappa values between 0.61 and 0.80 represent "substantial" agreement, while values above 0.81 indicate "almost perfect" agreement. For AIG's 88% raw agreement, the Kappa value depends on the fraud prevalence in the 100-claim sample, which was not disclosed. If the sample contained 20% fraudulent claims (typical for a flagged-for-review pool), the corresponding Kappa would fall in the 0.65 to 0.75 range, qualifying as substantial but not almost perfect agreement.

The medical AI regulatory parallel is instructive. The FDA cleared 295 AI and ML medical devices in 2025 alone, with 96% approved via the 510(k) pathway. Clinical validation for these devices typically uses reader studies comparing AI sensitivity and specificity against expert panels. Published concordance data shows AI sensitivity in radiology applications ranging from 84% to 93% versus clinician sensitivity of 76% to 90%. A key observation from JAMA Network Open's analysis of these clearances: "SaMD usually serves as an aid to clinicians; thus, the efficacy relies more on the device-clinician interaction than the achieved metric value of the SaMD alone." Insurance regulators tracking the FDA's approach may adopt similar reasoning, valuing agreement rates precisely because they measure the AI-human interaction rather than the AI in isolation.

Why Agreement Rates Resonate With Regulators

The NAIC's regulatory infrastructure is converging on a framework that implicitly favors agreement-rate metrics over purely statistical model performance measures. Three developments create this convergence.

The first is the four-tier AI risk taxonomy presented at the NAIC Spring 2026 National Meeting. The taxonomy classifies AI systems into unacceptable risk, high risk, medium risk, and low risk categories. NAIC staff asserted that most regulated AI systems in insurance would fall into the high-risk tier, defined as systems with the potential to cause significant harm if they fail or are misused. High-risk classifications trigger documentation requirements including performance metrics, bias testing, and governance controls. The taxonomy does not prescribe which metrics carriers must report, but the emphasis on "potential for harm" naturally points toward metrics that measure whether AI decisions align with the judgment of qualified professionals. A model with 94% F1 score that disagrees with expert adjusters 30% of the time presents a different risk profile than a model with 89% F1 that agrees with adjusters 92% of the time.

The second development is the AI Systems Evaluation Tool pilot, launched March 2, 2026, across 12 states: California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin. The pilot runs through September 2026. Exhibit C of the evaluation tool requires carriers to describe performance metrics for high-risk AI systems. Regulators have broad discretion over which metrics they consider adequate. Iowa Insurance Commissioner Doug Ommen framed the pilot's purpose: "It's important to understand that the pilot itself will be very instructive." Instructive, in this context, means regulators are learning which carrier-reported metrics are actually useful for oversight. Agreement rates, because they translate directly into a regulator-comprehensible statement ("the AI agrees with the human expert X% of the time"), have an inherent advantage over metrics that require statistical literacy to interpret.

The third is the U.S. Treasury's Financial Services AI Risk Management Framework (FS AI RMF), published February 19, 2026. Developed with input from over 100 financial institutions and aligned with the NIST AI Risk Management Framework, the FS AI RMF contains 230 control objectives spanning governance, data management, model development, validation, monitoring, and consumer protection. The framework explicitly requires model validation independence, bias testing, drift detection, and explainability thresholds. While it does not mandate agreement rates specifically, its emphasis on "common control language" across financial institutions creates pressure toward metrics that are standardized and comparable. Agreement rates, defined consistently, are inherently comparable across carriers, lines of business, and AI use cases in a way that bespoke F1 scores computed on different test sets are not.

Colorado's SB 21-169 provides the most concrete state-level precedent. The law requires insurers to inventory every algorithm and external data source used in pricing, test for discriminatory outcomes, and submit annual compliance reports with CRO attestation. Auto and health insurers must comply by July 1, 2026. While the statute does not specify agreement rates, the annual reporting requirement creates an incentive to develop metrics that are both defensible and interpretable to a non-technical audience. Agreement rates fit that requirement precisely.

Hartford's Algorithmic Impact Assessment: The Qualitative Complement

If AIG's 88% agreement rate represents the quantitative frontier of carrier AI governance disclosure, Hartford's Algorithmic Impact Assessment represents the qualitative counterpart. Hartford became the first top-20 carrier to publish a voluntary assessment in February 2026, covering bias audits for ZIP code, age, and property type across production AI models.

The two approaches address different governance surfaces. Agreement rates measure output concordance: does the AI and the human reach the same conclusion? Algorithmic impact assessments measure process quality: does the carrier's AI governance program include adequate controls for bias detection, data quality, model monitoring, and stakeholder accountability? A carrier could report a high agreement rate while running a deficient governance program, or maintain exemplary governance processes while an AI model diverges from expert judgment at unacceptable rates. Neither metric alone is sufficient.

Patterns we have seen in recent carrier filings suggest the industry is converging on a two-layer governance reporting model. The first layer consists of quantitative KPIs, agreement rates chief among them, reported at regular intervals and benchmarked against thresholds. The second layer consists of qualitative governance documentation, following the impact assessment model, updated annually or upon material model changes. The NAIC's evaluation tool effectively tests both layers through its four exhibits: Exhibit A inventories AI usage (quantitative scope), Exhibit B assesses governance risk (qualitative process), Exhibit C probes high-risk system performance (quantitative metrics), and Exhibit D examines data practices (qualitative and quantitative).

Hartford's approach also illustrates the governance documentation challenge that agreement rates partially solve. Publishing an impact assessment requires significant internal coordination across legal, compliance, data science, actuarial, and executive teams. Agreement rates, by contrast, can be computed from a single structured evaluation exercise and reported as a dashboard metric. The combination of periodic agreement-rate monitoring (automated, frequent, quantitative) with annual impact assessments (manual, comprehensive, qualitative) may represent the governance equilibrium that regulators and carriers can sustain operationally.

Measurement Challenges: Selection Bias, Thresholds, and the 12% Disagreement Boundary

Agreement rates are not a governance silver bullet. Several measurement challenges require careful design to prevent the metric from becoming misleading or gameable.

Selection bias in evaluation samples. AIG's 100-claim evaluation was not drawn from a random sample of all claims processed. The claims selected for fraud review are already a filtered subset, likely with higher fraud prevalence than the general population. An 88% agreement rate on flagged claims may not generalize to the full claims universe, where the fraud base rate is much lower. Carriers designing agreement-rate frameworks must specify the sampling methodology and report results stratified by claim complexity, line of business, and risk tier. An aggregate agreement rate that mixes straightforward auto glass claims with complex commercial liability disputes obscures more than it reveals.

Threshold effects and decision boundaries. Most AI models produce probability scores rather than binary classifications. A fraud detection model might assign a 0.72 fraud probability to a claim. Whether that claim is classified as "fraudulent" depends on the threshold the carrier sets. At a 0.50 threshold, the claim is flagged; at a 0.75 threshold, it is not. Agreement rates are sensitive to threshold selection. A carrier could improve its reported agreement rate by adjusting the classification threshold to match the base rate of human adjuster decisions, without improving the underlying model. Governance frameworks must therefore specify fixed threshold protocols or report agreement rates across multiple thresholds to prevent threshold gaming.

The disagreement boundary. AIG's 12% disagreement rate (the claims where Claude and the adjuster reached different conclusions) raises the most consequential governance question: what happens at the boundary? In 12 of 100 evaluated claims, the AI and the human disagreed. Were the humans right and the AI wrong? Was the AI right and the humans wrong? Did the disagreements cluster around edge cases where reasonable professionals could legitimately differ? The answers to these questions determine whether 88% agreement is a governance strength or a governance liability.

From tracking AI model validation programs, the disagreement boundary is where the most actuarially significant information lives. If the AI flags claims as fraudulent that adjusters classify as legitimate, the false-positive rate on fraud detection directly affects claims expense ratios and customer satisfaction. If the AI classifies claims as legitimate that adjusters flag as fraudulent, the false-negative rate affects incurred losses and potentially enables fraud leakage. Carriers reporting agreement rates without disaggregating the direction of disagreement leave regulators and actuaries unable to assess the financial impact of the metric.

Evaluator reliability. Agreement rates are only as meaningful as the human baseline. If the professional adjuster reviewing the 100 claims has idiosyncratic judgment patterns (unusually aggressive or conservative in fraud flagging), the agreement rate measures concordance with that individual's perspective rather than with professional consensus. Inter-rater reliability studies, where multiple adjusters independently review the same claims and their consistency is measured before comparing against AI outputs, would substantially strengthen the metric's governance credibility. Published inter-rater reliability data for insurance claims adjudication is scarce, which creates both a research gap and an opportunity for carriers to differentiate their governance programs.

Designing Agreement-Rate Frameworks for Pricing and Reserving

The immediate application of agreement rates in fraud detection has a natural extension to other actuarial functions. Pricing and reserving decisions increasingly incorporate AI model outputs, and the same governance question applies: does the AI reach conclusions consistent with qualified actuarial judgment?

For pricing, an agreement-rate framework might measure how often an AI-assisted rate indication falls within an acceptable range of the indication produced by the credentialed actuary. "Acceptable range" could be defined as within 2 percentage points of the actuarial rate indication, or within one standard deviation of historical indication variability for the same class and territory. This approach addresses the ASOP No. 56 requirement that actuaries using models based on algorithms or data must evaluate whether the model is appropriate for the intended purpose. An agreement rate provides a structured, repeatable way to document that evaluation.

For reserving, agreement rates could measure concordance between AI-assisted case reserve estimates and the reserves set by experienced claims examiners. This application is particularly relevant for carriers deploying AI to accelerate initial reserve setting on new claims, where the AI produces a preliminary estimate that a human examiner may adjust. Tracking the frequency and magnitude of human adjustments to AI-suggested reserves creates a natural agreement-rate metric with direct financial implications for IBNR estimation.

ASOP No. 56 requires actuaries to evaluate the data used, the reasonableness of assumptions, and the appropriateness of the model for its intended use. Agreement-rate monitoring provides evidence for all three requirements. If an AI pricing model consistently agrees with credentialed actuaries at a 92% rate on class-level indications, that concordance supports the model's appropriateness. If the agreement rate drops to 78% for a specific territory or coverage type, that divergence signals a scope limitation requiring actuarial intervention. The metric converts a subjective "is this model appropriate?" assessment into a quantitative monitoring protocol with defined escalation triggers.

The SOA's 2026 research initiative on agentic AI for actuarial workflows explicitly calls for governance blueprints covering model risk management, monitoring, human-in-the-loop controls, documentation, and bias assessment. Agreement rates fit naturally into this blueprint as the primary monitoring metric, with threshold breaches triggering the human-in-the-loop escalation that the SOA framework requires. The CAS AI Primer similarly reinforces the profession's role in ensuring AI outputs are "appropriate, explainable, and aligned with business and regulatory expectations," and agreement rates provide the quantitative backbone for proving alignment.

From Voluntary Disclosure to Required Reporting

AIG's disclosure of the 88% agreement rate was voluntary. No regulation required it. No accounting standard mandated it. Zaffino chose to share it because the metric supported AIG's narrative of responsible AI deployment at scale. That voluntary disclosure, however, creates a precedent that may accelerate the timeline toward required reporting.

The mechanism is competitive signaling. Once one top-10 carrier publishes a specific AI governance metric, peer carriers face implicit pressure to match or exceed that disclosure. Lemonade has already set a transparency precedent by reporting in its 2025 10-K that AI Jim handles first notice of loss 96% of the time without human intervention and that approximately 55% of claims are fully automated end to end. Travelers disclosed handling 1.5 million claims in 2025 with 90% of catastrophe claims closed within 30 days. Each disclosure raises the floor for what investors and regulators expect to see from peers.

The regulatory pathway from voluntary to mandatory runs through the NAIC evaluation tool pilot. If the 12-state pilot demonstrates that agreement-rate metrics (or similar concordance measures) provide regulators with actionable governance insight, the nationwide rollout expected at the Fall 2026 meeting could incorporate agreement rates as a recommended or required reporting element for high-risk AI systems. The U.S. Treasury's FS AI RMF further accelerates convergence by creating cross-sector expectations for comparable AI governance metrics. Insurance-specific implementations of the FS AI RMF will need to translate the framework's 230 control objectives into operational metrics. Agreement rates address multiple objectives simultaneously: model validation (does the model agree with experts?), consumer protection (are AI decisions consistent with professional standards?), and monitoring (is the agreement rate stable over time?).

Grant Thornton's 2026 survey finding that 42% of insurers track no AI metrics at all underscores how far the industry must travel. For the 58% that do track metrics, the challenge shifts from measurement to standardization. An agreement rate computed on 100 claims by one adjuster at one carrier is not directly comparable to an agreement rate computed on 10,000 claims by 50 adjusters at a different carrier. Industry-level standardization, likely emerging through the NAIC pilot process and refined through actuarial standard-setting bodies, will determine whether agreement rates become a credible governance KPI or a loosely defined metric carriers can manipulate to show favorable results.

For actuarial teams specifically, the immediate action item is clear. Carriers deploying AI in any function that touches consumer-facing decisions should begin measuring agreement rates now, before regulatory requirements crystallize. The measurement methodology should be documented, the sample design should be defensible, and the results should be reported with confidence intervals and disaggregated by decision type. Carriers that build this infrastructure voluntarily will shape the standards. Carriers that wait for mandates will be measured against standards they had no role in designing.

Sources

AIG Q1 2026 Earnings Call Transcript (May 1, 2026) - Motley Fool
AI Advancing Faster Than Expected as AIG Builds Multi-Agentic Solution (2026) - Reinsurance News
NAIC Spring 2026 National Meeting Highlights: H Committee Update (April 2026) - Mayer Brown
NAIC Expands AI Systems Evaluation Tool Pilot to 12 States (2026) - Fenwick
NAIC Spring 2026 Regulatory Update (April 2026) - Sidley Austin
U.S. Treasury Financial Services AI Risk Management Framework (February 2026) - U.S. Department of the Treasury
FDA AI/ML Device Clearances: Clinical Validation Analysis (2025) - JAMA Network Open
Interrater Reliability: The Kappa Statistic - PMC / Annals of Family Medicine
2026 AI Impact Survey Report: Insurance Edition (2026) - Grant Thornton
Agentic AI for Actuarial Workflows Research Initiative (2026) - Society of Actuaries
CAS AI Primer: A Practical Guide for Actuaries (2026) - Casualty Actuarial Society
AI Collaboration and Risk Management (2026) - The Hartford
Lemonade 2025 Annual Report / 10-K - SEC Filings