AIG’s Third AI Patent Tackles the Hardest Problem in Insurance Document Processing: Multi-Table Spreadsheets

This is the third and final article analyzing AIG’s granted AI underwriting patents. Read Patent #1: How AIG Separates Tables from Text and Patent #2: The Traceability and Hallucination Detection Layer for the earlier components, and Inside AIG’s Agentic AI Underwriting Machine for the full strategy overview.

Executive Summary

The first two patents in AIG’s AI underwriting portfolio describe how the system separates tabular data from document text for independent processing (Patent #12,437,155) and how it traces every extraction back to its source material for auditability (Patent #12,437,154). Both patents assume that the incoming document is a PDF or similar image-based file where tables are embedded within pages of text.

This third patent addresses a different and arguably harder problem: what happens when the entire document is a spreadsheet?

U.S. Patent 12,511,320, titled “Retrieval augmentation system for unstructured tabular documents,” was filed on June 6, 2025, and granted on December 30, 2025. It is assigned to American International Group, Inc. and lists the same two inventors: Lei Zhang (New York) and Christopher Allen Cirelli (Roswell, Georgia). Unlike the earlier pair, this patent was filed roughly five months later, suggesting it represents a subsequent phase of development as AIG’s team encountered the specific challenges of spreadsheet processing at scale.

The core problem is deceptively simple. Insurance submissions frequently include spreadsheets (Excel files, Google Sheets exports, or spreadsheets converted to PDF) that contain multiple tables on a single sheet. These tables may be unrelated, formatted differently, positioned arbitrarily on the grid, and accompanied by metadata (units, footnotes, configuration parameters) that appears in non-standard locations. When these spreadsheets are converted to text for LLM processing, the result is a single stream of markdown that merges all tables together, filling the gaps between them with NULL or NaN values. The LLM cannot determine where one table ends and another begins, cannot associate footnotes with the correct table, and cannot distinguish between data rows and parameter settings.

Any actuary or underwriter who has ever received a broker’s custom Excel workbook with financial schedules, loss triangles, location lists, and coverage summaries all on the same sheet will immediately recognize this problem. Traditional rule-based extraction systems cannot handle the variability. Each spreadsheet may have a unique layout created by a different person with different formatting preferences.

AIG’s patent solves this with two innovations. The first is a chain-of-thought prompting methodology: rather than asking the LLM to extract data from the entire spreadsheet at once, the system constructs a multi-step prompt that first asks the LLM to count how many individual tables exist, then identify each table’s boundaries, then extract metadata (headers, units, footnotes) for each table, and finally reconstruct each table independently in a clean markdown format. By chaining these steps together, each building on the previous answer, the LLM produces more accurate and consistent results than it would from a single monolithic extraction prompt.

The second innovation is a hybrid text-and-vision architecture. The patent describes a system that can process documents through either a text-based LLM path or an image-based multi-modal language model (MMLM) path, and can switch between them dynamically. If the text-based path fails for a given document or chunk (due to OCR errors, complex layouts, or handwritten annotations), the system can fall back to the MMLM, which processes the actual page image and can interpret spatial relationships, visual cues, and even handwritten selections on forms. This hybrid approach ensures that the system maintains extraction accuracy across the full range of document types that arrive in commercial insurance submissions.

For an industry where financial data accuracy drives pricing adequacy, reserve estimates, and portfolio construction, the ability to reliably process complex spreadsheets at scale is not an incremental improvement. It is the difference between a system that works on clean, well-formatted documents and one that works on the messy, inconsistent, real-world documents that actually flow through commercial insurance underwriting.

Patent Details

Patent Number	U.S. 12,511,320
Title	Retrieval augmentation system for unstructured tabular documents
Filed	June 6, 2025
Granted	December 30, 2025
Assignee	American International Group, Inc. (New York, NY)
Inventors	Lei Zhang (New York, NY); Christopher Allen Cirelli (Roswell, GA)
Application No.	19/231,394
Classification	G06F 16/334; G06F 16/31; G06F 40/143
Primary Examiner	Etienne P. Leroux

The Problem: Why Spreadsheets Break Standard AI Extraction

The two earlier AIG patents solve the problem of extracting data from PDFs that contain both text and tables on the same page. This third patent addresses a fundamentally different challenge: documents where the entire content is tabular, but the tables themselves are unstructured.

The patent’s background section describes the problem with clarity that reflects real production experience. Spreadsheets are popular across insurance because their grid layout allows humans to interpret large amounts of data quickly. But the same flexibility that makes spreadsheets useful for humans makes them treacherous for automated systems.

A single spreadsheet may contain multiple independent tables. There is no standard format for how these tables are positioned. Document creators leave arbitrary amounts of whitespace between tables. Tables may not align in either dimension. Column widths, header styles, and data formats can vary from table to table within the same sheet. Metadata like units of measurement may appear as column headers, footnotes below a table, a note in a separate cell, or a small configuration table at the top of the sheet. Financial documents frequently use footnotes to indicate that all values are in thousands or millions, or to specify currency, and these footnotes may be anywhere on the page.

When such a spreadsheet is converted to text (either through OCR if it was exported to PDF, or through direct file parsing for native Excel/Sheets formats), the output is a single stream of markdown that represents every cell in the smallest rectangle encompassing all filled cells. Empty cells between tables are filled with NULL or NaN placeholders. The patent notes several failure modes that result.

The semantic meaning of individual tables is diluted or destroyed in the combined markdown, causing vector embeddings to be unreliable for retrieval. The LLM cannot determine table boundaries and may attempt to interpret the entire sheet as one massive table, associating column headers from one table with data rows from another. The NULL/NaN values between tables overwhelm the LLM and lead to incorrect extractions or hallucinated outputs. Traditional rule-based approaches fail because each spreadsheet may use a completely different layout. OCR systems, even when they can identify individual tables surrounded by text in a PDF, struggle when the entire document is tabular data in a custom layout.

In insurance underwriting, these spreadsheets are everywhere. Schedule of values for property insurance (location addresses, building values, construction types, occupancies, all in a single spreadsheet with varying layouts). Loss run summaries with separate tables for each policy year. Financial statement workbooks with income statements, balance sheets, and ratio analyses on the same sheet. Broker-compiled submission packages that aggregate data from multiple sources into a single Excel file.

The Solution: Chain-of-Thought Prompting for Table Identification

The patent’s primary technical innovation is a structured prompting methodology that breaks the complex task of multi-table extraction into a sequence of manageable steps that build on each other.

The Chain-of-Thought Prompt Architecture

Rather than providing the LLM with the entire spreadsheet markdown and asking it to extract specific data points (which fails for the reasons described above), the system first uses the LLM to identify and separate the individual tables before any data extraction occurs.

The patent describes a chain-of-thought prompt with four linked steps.

Step 1: Quantify. The prompt asks the LLM to determine how many individual, independent tables exist in the document. The patent specifies that this request should be precise, indicating that the result should be a single number, to prevent the LLM from interpreting “quantify” in some other way (such as measuring the amount of data in each table). This counting step anchors the entire chain. Every subsequent step can reference this count, and the LLM is encouraged to maintain consistency between its count and its identifications.

Step 2: Identify. The prompt asks the LLM to identify each individual table. The patent notes that this step may explicitly reference the count from Step 1, for example by including language like “please identify the same number of tables as those counted previously.” This cross-referencing between steps is critical to the chain-of-thought approach: by making each step dependent on the previous answer, the system pushes the LLM toward internally consistent results.

Step 3: Extract metadata. For each identified table, the prompt asks the LLM to extract metadata: column headers, row headers, units, footnotes, descriptions, titles, and relationships between tables. The patent notes that this step may direct the LLM to pay particular attention to the area just outside the boundaries of each table, where metadata like footnotes and unit indicators typically appears. The prompt may also instruct the LLM to consider whether small tables near larger tables are actual data tables or configuration parameters that apply to the larger tables.

Step 4: Reconstruct. The prompt asks the LLM to reconstruct each identified table independently using a clean markdown format. The prompt may include a definition of the desired markdown syntax (pipe characters for column delimiters, newline characters for row breaks) or provide an example of the expected output format. Critically, this step suggests using the metadata extracted in Step 3, ensuring that the reconstructed tables include their proper headers, units, and context.

The patent emphasizes that these steps are designed as a chain-of-thought prompt, meaning they are engineered to “encourage a step-by-step reasoning process with steps that are connected to the previous step.” Each step builds on the foundation laid by the earlier steps. The LLM breaks down the complex task of understanding a multi-table spreadsheet into “manageable steps that help ensure complete analysis and accurate results.”

Unit Standardization

The patent includes a dedicated “unit standardizer” component that addresses a problem specific to financial documents. Different tables within the same spreadsheet may use different units (dollars vs. thousands of dollars vs. millions). A single table may have units specified in a footnote rather than in column headers. The patent describes a system that uses the metadata extracted by the chain-of-thought prompt (Step 3) to identify units, apply conversion factors, and standardize all values to a preferred unit system before the data enters the retrieval index.

The patent specifically calls out financial documents as a use case: “financial tables may often have a note indicating that all numbers are in thousands of dollars, millions of dollars, or the prevailing currency associated with the table.” In insurance underwriting, where premium volumes, loss amounts, and insured values can span orders of magnitude, a system that misses a “values in thousands” footnote can produce extraction errors of 1,000x, errors that would cascade into pricing models and bind decisions.

From Reconstructed Tables to Retrieval Index

After the chain-of-thought prompt has identified, separated, and reconstructed individual tables, each table feeds into the same downstream pipeline described in the first patent: individual table chunks are created, vector embeddings are generated from the table headers and metadata, and the chunks are indexed for retrieval. When a subsequent extraction prompt asks for a specific data element, the system retrieves only the relevant table (or table chunk), not the entire multi-table spreadsheet. This targeted retrieval is what makes accurate extraction possible from documents that would otherwise overwhelm the LLM.

The Hybrid Architecture: Text Path and Vision Path

The second major innovation in Patent 12,511,320 is a dual-path processing architecture that allows the system to switch between text-based and image-based processing at every stage: ingestion, indexing, and extraction.

Text-Based Path

The text path processes documents through OCR (for image-based files) or direct parsing (for native spreadsheet formats), generates markdown, creates text chunks, and indexes them using vector embeddings. For extraction, text chunks are retrieved and provided to a text-based LLM (the patent again names Claude as an example). This is the computationally cheaper path and works well for documents with clean text, standard layouts, and few visual elements.

Image-Based Path (Multi-Modal Language Model)

The image path processes documents as images, using a multi-modal language model (MMLM) that can interpret both text and visual information. For ingestion, the MMLM can summarize a page image, and that summary is then embedded and indexed. For extraction, the actual page image (converted to PNG or similar format) is provided to the MMLM alongside the extraction prompt.

The MMLM path is significantly more computationally expensive than the text path. But the patent describes specific scenarios where it is necessary or advantageous.

Handwritten content. Insurance forms, questionnaires, and applications may include handwritten responses. The MMLM can interpret handwriting, recognize selections (circled answers, checked boxes, filled bubbles), and distinguish handwritten annotations from printed text. The patent describes this capability in detail, noting that the MMLM can determine whether a respondent marked an answer by circling text, filling in a shape, or placing a check mark.

Complex layouts where OCR fails. When the text-based path produces a low “comprehension score” (a metric that the text embedding model can generate to indicate whether the OCR output has coherent semantic meaning), the system can automatically switch to the MMLM path. This fallback mechanism ensures that complex or poorly formatted documents do not silently fail.

Visual context that text cannot capture. Some information is conveyed through spatial relationships on a page: the position of a table relative to its title, the grouping of related data by visual proximity, the layout of a multi-column document. The MMLM can interpret these spatial relationships in ways that a text-based LLM processing a linear stream of markdown cannot.

Dynamic Path Selection

The patent describes a system that can independently choose the text path or image path for each stage (ingestion and extraction) and for each document or even each page within a document. This granularity is important because many insurance submission packages include a mix of document types: some pages are clean text, others contain complex tables, and still others include handwritten annotations. The system can use the cheap text path for straightforward pages and reserve the expensive MMLM path for the pages that require it.

The patent also describes an upgrade path for existing systems. If a carrier has already ingested documents using the text-based approach and built a text-based retrieval index, it can add MMLM-based extraction without re-ingesting all documents. The system retrieves relevant chunks using the text-based index, then fetches the corresponding page image and provides it to the MMLM for extraction. This allows carriers to adopt the multi-modal capability incrementally without discarding their existing infrastructure.

Mapping to AIG’s Production System

The Chain-of-Thought Prompt = Auto Extract’s Spreadsheet Mode. When Zaffino described Auto Extract on the Q3 2025 earnings call, he emphasized that the system handles “documents in multiple formats.” The chain-of-thought prompting methodology in this patent is how Auto Extract processes the spreadsheet formats that are common in E&S submissions. The four-step reasoning process (count tables, identify tables, extract metadata, reconstruct in markdown) is the technical implementation behind AIG’s ability to ingest broker workbooks that contain financial schedules, loss histories, and property schedules on the same sheet.

The MMLM Path = Processing Scanned and Handwritten Documents. E&S submissions frequently include scanned documents, particularly for older loss run reports, handwritten supplemental questionnaires, and signed applications. The hybrid text/MMLM architecture explains how AIG’s system maintains extraction accuracy across the full spectrum of document quality levels that arrive in the E&S submission flow. The ability to fall back from text to vision processing on a per-page basis is what allows AIG to claim 100% processing coverage for its financial lines submissions.

Unit Standardization = Financial Data Accuracy at Scale. AIG’s reported improvement in data accuracy from 75% to over 90% is partly attributable to the unit standardization capability. In insurance underwriting, a loss run that reports values in thousands and a financial statement that reports in millions must be reconciled before the data can inform a pricing decision. The automated unit conversion described in this patent eliminates a category of errors that have traditionally required manual review.

The Atlanta Innovation Hub Connection. Both inventors are listed in the same locations as the earlier patents (New York and Roswell, GA), confirming that the same team at AIG’s Atlanta Innovation Hub developed all three patents. AIG’s newsroom article about the hub described the team as applying LLMs “to hundreds of documents that arrive in various forms as unstructured data.” The chain-of-thought prompting methodology in this patent represents the specific technique they developed for the most challenging document type in that population.

What the Claims Protect

Patent 12,511,320 contains 19 granted claims covering methods and one system embodiment.

Claims 1 through 9 cover a method for information extraction using retrieval augmentation. The essential steps are: receiving a document with multiple tables, generating prompts that include requests to identify tables, extract metadata, and reconstruct tables in markdown, generating table chunks from the reconstructed tables, generating vector embeddings for those chunks and storing them in a retrieval index, identifying relevant chunks for an extraction prompt, and storing the LLM’s response. Dependent claims add chain-of-thought prompting structure (Claims 6 and 7), vector embedding distance calculations (Claim 2), keyword search of metadata (Claim 3), table quantification and boundary determination (Claim 4), unit conversion (Claim 5), LLM as the text-based model with markdown input (Claim 8), and MMLM as the extraction model (Claim 9).

Claims 10 through 18 cover a method focused specifically on the ingestion pipeline: receiving the document, generating prompts, creating table chunks, and building the retrieval index. This protects the ingestion methodology independently from the extraction methodology.

Claim 19 covers the system embodiment, specifically protecting the chain-of-thought prompt structure with its four sequential requests: quantify tables, identify tables, extract metadata, and reconstruct using markdown with the metadata from the previous step.

Claim 7 is particularly detailed, protecting a specific four-step chain-of-thought structure where Step 1 quantifies tables, Step 2 identifies tables with explicit reference to the count from Step 1, Step 3 extracts metadata, and Step 4 reconstructs tables using the metadata from Step 3. This level of specificity in the claims suggests that AIG’s legal team viewed the chain-of-thought prompt architecture as a genuinely novel contribution worthy of narrow, defensible protection.

Actuarial and Industry Implications

For pricing actuaries working with submission data. The spreadsheets that flow through commercial insurance underwriting are some of the most varied, inconsistent, and error-prone documents in the actuarial data pipeline. Schedule of values workbooks, financial statement summaries, and loss run compilations all arrive in formats determined by the individual broker, agent, or insured. A system that can reliably parse multi-table spreadsheets, standardize units, and extract structured data from these documents addresses a data quality bottleneck that has persisted for decades. The unit standardization feature alone could eliminate a category of pricing errors that arises when financial data in different scales is inadvertently mixed.

For actuaries evaluating AI readiness. The hybrid text/MMLM architecture described in this patent represents a maturity level that goes beyond what most carriers have implemented. The ability to dynamically select the processing path based on document quality, fall back from text to vision processing when needed, and handle handwritten content alongside machine-generated text suggests that AIG’s system was designed for the full range of real-world document conditions, not just the clean, well-formatted documents that work in demos.

For reserving and claims actuaries. The patent’s claims handling context is less explicit than its underwriting focus, but the technology applies directly. Loss run reports, claim adjuster notes, and settlement documentation often arrive as spreadsheets or scanned documents with the same multi-table, mixed-format characteristics described in the patent. The chain-of-thought approach to table identification and the MMLM’s ability to process handwritten annotations have direct relevance to claims data extraction.

For enterprise risk management. This patent completes the picture of AIG’s AI underwriting infrastructure as a system designed for production resilience. The fallback from text to vision processing, the comprehension scoring that triggers automatic path switching, and the version history tracking (inherited from the companion patents) all reflect an architecture built to handle failure gracefully rather than one that assumes clean inputs. For ERM actuaries evaluating operational risk in AI deployments, the explicit error-handling and fallback mechanisms described in this patent are informative reference points.

For the competitive landscape. With three granted patents covering document extraction (12,437,155), traceability and error control (12,437,154), and multi-table spreadsheet processing (12,511,320), AIG now holds intellectual property protection across the complete lifecycle of AI-assisted document processing for insurance underwriting. The chain-of-thought prompting methodology for spreadsheet table identification is particularly novel and may constrain how competing carriers approach the same problem. Carriers building similar systems will need to evaluate whether their chain-of-thought or multi-step prompting approaches fall within AIG’s claims.

The Complete Patent Portfolio: What AIG Has Built

Viewed together, AIG’s three AI underwriting patents describe a coherent, end-to-end system for processing the full range of documents that arrive in commercial insurance submissions.

Patent #12,437,155 provides the foundational architecture: separating tables from text, creating independent processing pipelines with different retrieval strategies for each, and populating an ontological data store through LLM-based extraction.

Patent #12,437,154 layers on governance: chunk-level traceability, LLM source attribution, hallucination detection, version history, timestamp tracking, and automated citation generation for regulatory compliance.

Patent #12,511,320 extends the system to handle the most challenging document type: complex, multi-table spreadsheets with custom layouts, non-standard metadata, and mixed data formats, using chain-of-thought prompting and a hybrid text/vision processing architecture.

All three patents name the same inventors (Lei Zhang and Christopher Cirelli), the same assignee (American International Group, Inc.), and the same LLM by name (Claude). All three explicitly describe insurance underwriting as the primary use case. And all three populate the same ontological data store, which maps directly to the Palantir Foundry ontology that AIG has described in its public investor communications.

To our knowledge, this is the most technically detailed public documentation of how any major insurer is using large language models in production underwriting. The patents provide a level of architectural specificity that goes far beyond what is available in earnings call transcripts, analyst presentations, or trade press coverage. For actuaries, technologists, and insurance executives trying to understand where AI-assisted underwriting is headed, these patents are primary source material.

Sources

U.S. Patent No. 12,511,320, “Retrieval augmentation system for unstructured tabular documents,” filed June 6, 2025, granted Dec. 30, 2025. Assignee: American International Group, Inc. Inventors: Lei Zhang, Christopher Allen Cirelli. Justia
U.S. Patent No. 12,437,155, “Information extraction system for unstructured documents using independent tabular and textual retrieval augmentation,” filed Jan. 24, 2025, granted Oct. 7, 2025. Assignee: American International Group, Inc. Justia
U.S. Patent No. 12,437,154, “Information extraction system for unstructured documents using retrieval augmentation providing source traceability and error control,” filed Jan. 24, 2025, granted Oct. 7, 2025. Assignee: American International Group, Inc. Justia
AIG Q3 2025 Earnings Conference Call, Reinsurance News, November 7, 2025. reinsurancene.ws
AIG Q4 2025 Earnings Call Transcript, Yahoo Finance, February 11, 2026. finance.yahoo.com
“AIG: Turning One Human Underwriter Into Five, ‘Turbocharging’ E&S.” Carrier Management, April 28, 2025. carriermanagement.com
“AIG Leaders Discuss GenAI and the Atlanta Innovation Hub.” AIG Newsroom. aig.com
“AIG to Roll Out AIG Assist Across Lexington by End of 2025.” Coverager, December 1, 2025. coverager.com
“Anthropic Unveils Claude’s New Finance-Focused Platform.” Fintech Global, July 17, 2025. fintech.global
“AIG CEO Zaffino Highlights Integration of GenAI to Create Digital Twin of Business.” Insurance Journal, August 12, 2025. insurancejournal.com