This is the first of three articles analyzing AIG’s granted AI underwriting patents. See also: Inside AIG’s Agentic AI Underwriting Machine for the full overview of AIG’s AI strategy.

Executive Summary

When AIG CEO Peter Zaffino told analysts on the Q3 2025 earnings call that the company had developed a “patent-pending approach called Auto Extract” for its underwriting AI system, it was a notable disclosure. Insurance carriers rarely discuss patent filings in the context of underwriting technology. But AIG was signaling something specific: the company had not only built an AI-assisted underwriting workflow, it had invented novel technical methods worth protecting as intellectual property.

That patent has now been granted. U.S. Patent 12,437,155, titled “Information extraction system for unstructured documents using independent tabular and textual retrieval augmentation,” was filed on January 24, 2025, and granted on October 7, 2025. It is assigned to American International Group, Inc. and lists two inventors: Lei Zhang, based in New York, and Christopher Allen Cirelli, based in Roswell, Georgia, which is the location of AIG’s Atlanta Innovation Hub that the company opened specifically to scale its generative AI operations.

The patent describes a system that solves a fundamental problem in AI-assisted document processing: when you run OCR on an insurance submission PDF that contains both narrative text and financial tables, the resulting output interleaves tabular data with body text in ways that destroy the semantic meaning of both. If you then feed that jumbled output to a large language model for data extraction, the model cannot reliably distinguish table boundaries, associate column headers with the correct data values, or understand the narrative context around the tables. The result is poor extraction accuracy, hallucinated responses, and wasted computational resources from repeated prompts.

AIG’s patent addresses this by separating tables from text at the point of OCR ingestion using markdown language patterns, creating independent “table chunks” and “text chunks” with different retrieval strategies for each, building a searchable vector embedding index that preserves the semantic meaning of both data types, and then routing the appropriate chunk type to the LLM based on what the extraction prompt is actually requesting. The system populates an “ontological data store” with the extracted information, a term that maps directly to the Palantir Foundry ontology that AIG has described in its public investor communications.

For actuaries and insurance professionals, the significance is not abstract. This patent describes the technical machinery behind AIG’s ability to process 100% of its private and not-for-profit financial lines submissions without adding underwriters, compress review timelines by more than 5x, improve data accuracy from approximately 75% to over 90%, and scale toward 500,000 annual E&S submissions. The patent explicitly names insurance underwriting as its primary use case, specifically citing directors and officers liability insurance, environmental insurance, and the extraction of financial data from broker correspondence, loss runs, applications, and claim histories.

Perhaps most notably, the patent names Claude, the large language model built by Anthropic, as the publicly available LLM that may be used within the system. This aligns precisely with AIG’s public disclosures about its progression from Claude 2.1 through Claude 4.6, as discussed across multiple earnings calls.

Patent Details

Patent Number U.S. 12,437,155
Title Information extraction system for unstructured documents using independent tabular and textual retrieval augmentation
Filed January 24, 2025
Granted October 7, 2025
Assignee American International Group, Inc. (New York, NY)
Inventors Lei Zhang (New York, NY); Christopher Cirelli (Roswell, GA)
Application No. 18/831,434
Classification G06F 40/289 (Natural Language); G06F 40/143
Primary Examiner Thierry L. Pham

The Problem: Why Standard RAG Fails on Insurance Documents

To understand what AIG patented, it helps to understand why existing approaches to AI-powered document extraction struggle with the types of documents that flow through commercial insurance underwriting.

A typical E&S submission package might include a completed application form (often a PDF with both narrative questions and tabular schedules), one or more years of loss run reports (dense financial tables), broker cover letters (narrative text with embedded references to coverage terms and pricing), financial statements (tables with footnotes, sometimes in non-standard units), and supplemental questionnaires specific to the risk class. These documents arrive in varied formats: true PDFs, scanned images, spreadsheets exported to PDF, and occasionally native Word or Excel files. The first step in any automated extraction workflow is to convert them to machine-readable text, typically using optical character recognition.

The problem, as AIG’s patent describes in its background section, is that OCR output from a document containing both text and tables interleaves the two data types in ways that break semantic meaning. A table that appears alongside explanatory text on a PDF page gets converted into a single stream of characters where table cell values are mixed with paragraph text. When this mixed output is fed into a retrieval-augmented generation (RAG) system, several failure modes emerge.

First, the vector embeddings used for semantic search become unreliable. Embedding models are designed to capture the semantic meaning of text passages. When a text chunk contains fragments of a financial table mixed with narrative description, the resulting vector embedding represents neither the table data nor the narrative accurately. The RAG system then fails to retrieve the correct document sections when a prompt asks for specific data.

Second, even when the correct document sections are retrieved, the LLM itself struggles to interpret the mixed content. The patent notes that the LLM may not recognize table boundaries, cannot reliably associate row and column headers with data values, and may either fail to return the requested data or, worse, hallucinate a plausible but incorrect response.

Third, the computational costs multiply. Every failed extraction attempt requires re-prompting the LLM with adjusted parameters or additional document context, increasing token consumption, processing time, and API costs.

For insurance underwriting specifically, these failures are particularly costly. A data accuracy rate of 75%, which is what AIG reported before deploying its AI system, means that one in four extracted data points requires manual correction by an underwriter. At the volume of 370,000+ submissions AIG currently processes, even small improvements in extraction accuracy eliminate enormous amounts of manual rework.

The Solution: Independent Tabular and Textual Processing Pipelines

The core invention in Patent 12,437,155 is an architecture that separates the processing of tabular data and document text into independent pipelines from the point of document ingestion all the way through to LLM-based extraction.

Step 1: OCR and Markdown-Based Separation

The system begins by processing incoming documents through an OCR system that returns plain text with tables marked using a markdown language. The markdown uses specific symbols to indicate table boundaries: pipe characters to delimit columns within a row, newline characters to separate rows, and hyphen characters to distinguish header rows from data rows. The patent describes a “markup decoder” component that uses regular expressions to identify these markdown patterns in the OCR output and separate the text into two portions: a first portion containing the tables and a second portion containing the narrative document text.

This separation step is deceptively important. By identifying and extracting tables before any further processing occurs, the system ensures that downstream components never have to deal with the mixed-content problem that plagues standard RAG implementations.

Step 2: Independent Chunking

After separation, each data type is processed by its own chunking component. The “table chunker” generates table chunks that may include an entire table or a specified number of rows. Critically, the table chunker can also generate a separate chunk for just the table header row, which typically contains the most semantically meaningful text in a table (column names, property descriptions, units) and is most useful for retrieval purposes.

The “text chunker” generates standard text chunks of configurable length (the patent mentions 500 words, 500 characters, or 1,000 tokens as examples), with optional overlapping to ensure that words near chunk boundaries retain sufficient context for accurate embedding. The patent notes that the optimal chunk length involves a trade-off: longer chunks give the LLM more context but increase computational cost, while shorter chunks may lose semantic meaning in the vector embedding.

Step 3: Independent Indexing and Search

Both table chunks and text chunks are converted into vector embeddings and stored in a searchable index. However, the patent describes different search strategies for each type. Text chunks are primarily searched using semantic similarity (comparing vector embeddings of the prompt to vector embeddings of the chunks using distance metrics). Table chunks can be searched using semantic similarity of their headers, keyword-based search of row and column headers, or a weighted combination of both.

This dual-search architecture is significant because it recognizes that tabular data and narrative text respond to different retrieval strategies. A prompt asking for “total insured value by location” is better served by a keyword match against table column headers than by semantic similarity against narrative text chunks.

The patent also describes hierarchical retrieval parameters that can be associated with specific prompt templates. If a first, narrow search does not return enough relevant chunks (failing to satisfy a “retrieval criterion”), the system automatically broadens the search using secondary parameters. This ensures that the LLM receives sufficient context even when the document set is sparse or the initial search terms are too specific.

Step 4: LLM Extraction and Ontological Data Store Population

When the system needs to extract a specific data element, it generates a prompt from a stored template, retrieves the relevant table chunks or text chunks (or both, depending on the retrieval parameters), and sends the combined prompt and chunks to the LLM. The patent specifies that the LLM “may be a publicly available LLM such as Claude,” confirming the Anthropic connection that AIG has described in its public disclosures.

The LLM’s response is validated by a “response validator” that checks whether the output matches expected parameters (numeric type, acceptable range, expected length, etc.) stored with the prompt template. Valid responses populate a “data model, ontological model, an ontological data store” with the extracted information. This ontological data store language maps directly to the Palantir Foundry ontology that AIG has described in earnings calls and investor presentations, the system that Zaffino has called the “digital twin” of AIG’s business.

Step 5: Traceability Through Chunk Identification

Each chunk carries metadata including a chunk identifier, a document identifier, and a page identifier. When the LLM extracts data, the system can trace exactly which source document and which page contributed to that extraction. If the response validator detects an error or the underwriter questions a result, the traceability metadata allows the user to view the original source material and understand why the LLM produced a given output.

The patent notes that in “regulated industries, it may be necessary to include the reference material (e.g., as a footnote or citation) to show that the system is accurately populating the data elements and/or is unbiased.” This language reflects the regulatory reality that actuaries and underwriting officers face: AI-assisted decisions require demonstrable auditability, particularly in the context of the NAIC’s Model Bulletin on the Use of Artificial Intelligence Systems by Insurers, which requires insurers to maintain governance frameworks for algorithmic decision-making.

Mapping Patent Architecture to AIG’s Production System

What makes this patent analysis particularly valuable for insurance professionals is how precisely its technical architecture maps to what AIG has disclosed publicly about its production AI system. The connections are specific and verifiable.

Auto Extract = The Markup Decoder + Table/Text Chunker. When Zaffino described Auto Extract on the Q3 2025 earnings call as “a capability that uses large language models to pull specific structured information from unstructured text, such as documents in multiple formats, websites and conversations,” he was describing the system architecture laid out in Claims 1 through 11 of this patent. The markdown-based separation of tables from text, the independent chunking, and the prompt-driven extraction are all core components of Auto Extract as publicly described.

Ontological Data Store = Palantir Foundry Ontology. The patent’s repeated references to populating an “ontological data store” with extracted data elements correspond directly to AIG’s public statements about building ontologies using Palantir’s Foundry platform. Zaffino told Insurance Journal in August 2025 that AIG’s intent was to “create a digital twin of our business” representing “all key data, processes, business logic and a map of relationships across businesses and functions.” The patent describes the technical mechanism for populating that digital twin with data extracted from underwriting documents.

Claude as the Named LLM. Both patent documents in this family (12,437,155 and its companion 12,437,154) explicitly state that the LLM “may be a publicly available LLM such as Claude.” AIG’s earnings call disclosures confirm the company’s progression from Claude 2.1 through Claude 4.6, making this one of the rare cases where a granted patent names a specific commercial AI model and the patent holder has separately confirmed using that exact model in production.

Insurance Underwriting as the Named Use Case. The patent does not describe its use case in abstract terms. It specifically identifies “the underwriting process of insurance policies,” names “directors and officers liability insurance and/or environmental insurance” as example use cases, and lists “applications, broker correspondence, financials, summary of claims, historical claims filed under business insurance policies (‘Loss Run’) and historical claim losses” as document types the system processes. This is not a general-purpose document extraction patent that happens to be filed by an insurer; it is an insurance underwriting patent that leverages general-purpose AI technology.

The Atlanta Innovation Hub. Co-inventor Christopher Cirelli is listed with an address in Roswell, Georgia, a suburb of Atlanta. AIG opened its Atlanta Innovation Hub specifically to build out its generative AI capabilities. An AIG newsroom article about the hub described it as supporting “AIG’s plans to responsibly scale the use of GenAI to drive significant benefits across our business, starting with underwriting.” The hub’s work was described as applying LLMs “to hundreds of documents that arrive in various forms as unstructured data, reducing inefficiencies and errors while enabling underwriters to focus on more value-added tasks,” language that closely mirrors the patent’s description of its system.

What the Claims Actually Protect

For readers interested in the intellectual property dimension, the patent’s 20 granted claims cover three core methods and one system embodiment.

Claims 1 through 11 cover a method for extracting information from documents. The essential steps are: receiving OCR output with tables in markdown, separating tables from text using the markdown patterns, forming independent table chunks and text chunks, identifying relevant chunks using search criteria, and storing the LLM’s response. Dependent claims add specifics: regular expression-based table identification (Claim 3), vector embedding indexing (Claim 6), keyword search of table headers (Claim 7), document/page identifier traceability (Claims 8-9), PDF as the input format (Claim 10), and populating a data store with extracted elements (Claim 11).

Claims 12 through 19 cover a method for preparing documents for retrieval augmentation, essentially protecting the ingestion pipeline independently from the extraction pipeline.

Claim 20 covers the system itself as a hardware/software embodiment.

The breadth of these claims suggests that AIG is seeking to protect not just its specific implementation but the general architectural approach of separating tabular and textual data for independent processing within a RAG framework applied to document extraction. Any competitor building a similar system for processing insurance submissions would need to evaluate whether their architecture falls within these claims.

Actuarial and Industry Implications

For pricing actuaries: The system’s ability to extract structured data from financial tables at 90%+ accuracy has direct implications for data quality in pricing models. When loss run data, financial statement figures, and application schedules are extracted more reliably, the downstream pricing models built on that data produce more consistent results. Actuaries who have dealt with data quality issues in commercial lines pricing (which is to say, virtually all pricing actuaries) should recognize the significance of a system that automates the most error-prone step in the data pipeline.

For reserving actuaries: The traceability architecture, which is expanded significantly in the companion patent (12,437,154), creates an audit trail from extracted data back to source documents. This is relevant for reserve analyses that rely on claim data extracted from loss run reports, particularly when regulators or auditors request documentation of data sources.

For actuaries in leadership roles: AIG has now patented core components of its AI underwriting infrastructure. This is a competitive signal. Carriers evaluating whether to build similar systems need to consider not just the technology investment but the IP landscape. AIG’s patents may constrain how competitors architect their own document extraction systems, particularly if those competitors follow a similar approach of separating tabular and textual data within a RAG framework.

For the profession broadly: The explicit mention of “regulated industries” and the need for citation and auditability within the patent text reflects an awareness that AI-assisted underwriting operates within a compliance framework. The system is designed from the ground up to support the kind of model governance and documentation that ASOP No. 56 (Modeling) and the NAIC Model Bulletin on AI require. This stands in contrast to implementations where auditability is bolted on after the fact.

What Comes Next in This Series

This patent covers the foundational architecture: how documents are ingested, separated, chunked, indexed, and retrieved for LLM-based extraction. The companion patents extend this architecture in two important directions.

Patent #12,437,154 goes deeper on the traceability and error control layer, detailing how the system detects LLM hallucinations, maintains version histories across document updates, generates citations for regulatory compliance, and creates audit trails that allow underwriters and compliance teams to verify every extracted data point.

Patent #12,511,320 addresses the specific challenge of processing complex spreadsheets with multiple tables, introducing a chain-of-thought prompting methodology that guides the LLM through a step-by-step reasoning process to identify, separate, and reconstruct individual tables from a single grid-based document, including unit conversion for financial data.

Together, the three patents describe an end-to-end AI underwriting document processing system that is, to our knowledge, the most technically detailed public description of how any major insurer is using large language models in production underwriting.

Sources

  1. U.S. Patent No. 12,437,155, “Information extraction system for unstructured documents using independent tabular and textual retrieval augmentation,” filed Jan. 24, 2025, granted Oct. 7, 2025. Assignee: American International Group, Inc. Inventors: Lei Zhang, Christopher Cirelli. patents.justia.com
  2. AIG Q3 2025 Earnings Conference Call, as reported by Reinsurance News, November 7, 2025. reinsurancene.ws
  3. AIG Q4 2025 Earnings Call Transcript. Yahoo Finance, February 11, 2026. finance.yahoo.com
  4. “AIG CEO Zaffino Highlights Integration of GenAI to Create Digital Twin of Business.” Insurance Journal, August 12, 2025. insurancejournal.com
  5. “AIG Leaders Discuss GenAI and the Atlanta Innovation Hub.” AIG Newsroom. aig.com
  6. “AIG: Turning One Human Underwriter Into Five, ‘Turbocharging’ E&S.” Carrier Management, April 28, 2025. carriermanagement.com
  7. “AIG to Roll Out AIG Assist Across Lexington by End of 2025.” Coverager, December 1, 2025. coverager.com
  8. “Anthropic Unveils Claude’s New Finance-Focused Platform.” Fintech Global, July 17, 2025. fintech.global
  9. “AIG Leans on Generative AI to Speed Underwriting.” CIO Dive, November 6, 2024. ciodive.com

Stay ahead with daily actuarial intelligence - news, analysis, and career insights delivered free.

Subscribe to Actuary Brew Browse All Insights