From Unstructured Documents to Knowledge Graphs: EXL’s Data Ingestion Patents and What They Mean for Insurance Operations

Insurance companies sit on enormous volumes of unstructured data. Medical records attached to bodily injury claims, loss run reports spanning dozens of carriers, attending physician statements running hundreds of pages, regulatory filings in legacy formats, property inspection photographs, and call transcripts from first notice of loss. By some industry estimates, unstructured content accounts for roughly 80% of enterprise data at large insurers, and manual processing of this content can consume 5 to 10% of total enterprise bandwidth and related costs.

Three of EXL's recently granted patents address this problem directly. Together, they describe a coherent pipeline that moves insurance data from raw unstructured documents through entity extraction and contextual labeling into queryable knowledge graphs. This is not a single invention but rather a layered architecture: US 12,260,342 (Xtrakto.AI) handles the extraction of tables and text from document images; US 12,353,832 (Generic NER) identifies and contextualizes entities within that extracted data; and US 12,482,215 (Knowledge Graph) structures the results into an incrementally updatable, semantically searchable graph.

In this article, we walk through the actual patent claims for each of these three inventions, map them to EXL's disclosed product capabilities, compare the approach to AIG's extraction patents, and identify what each patent protects and what it leaves open for competing implementations.

Patent US 12,260,342: Multimodal Table Extraction and Semantic Search (Xtrakto.AI)

The first patent in EXL's pipeline was granted on March 25, 2025, and is a continuation-in-part of US 11,842,286 (granted December 12, 2023), which covered EXL's foundational machine learning platform for structuring enterprise data. The Xtrakto.AI patent extends that foundation with specific methods for extracting data from image-based documents and performing semantic search across the extracted results.

What the independent claims protect: The patent's method claims (Claim 3) describe a system that receives a query and a document containing unstructured data, performs pre-processing operations that include using a computer vision model to detect images within the document, generates bounding boxes to encapsulate detected regions, parses secondary images from within those regions, extracts alphanumeric data using a trained machine learning model, identifies "globally applicable items" (document-wide metadata like patient names or policy numbers) from images, generates a searchable data structure that stores extracted data relationally to those global items, performs semantic search against the searchable data structure, and transmits query responses to a target application.

The system claim (Claim 1) adds specificity: the extraction uses a trained neural network that generates bounding boxes for detected cells in a table, and the system determines coordinate sets within the source document corresponding to query response items so that answers can be traced back to their exact location in the original document.

The CNN-based table detection approach: The patent description reveals that table cell detection uses a convolutional neural network with a ResNet backbone (specifically mentioning ResNet-50 as an option). The fully convolutional component exploits interdependence between table detection and table structure recognition to segment both the table boundary and individual columns. The system generates bounding boxes according to spatial location, then feeds them to either a graph-based engine or a rule-based engine to determine logical cell positions. Text output from OCR is then assigned to predicted cell boxes based on spatial correspondence.

This is a fundamentally different approach from AIG's extraction architecture. AIG's Patent 12,437,155 converts documents to markdown format and separates table content from text content into independent processing pipelines, using chunking and vector embeddings to populate an ontological data store. EXL's approach keeps the visual structure of the document intact, using computer vision to detect cell boundaries within images and then extracting text from those spatially defined regions. AIG's method is text-first (convert to markdown, then process). EXL's method is image-first (detect visual structure, then extract text).

Semantic search on extracted tables: Dependent claims 7 through 9 describe the semantic search layer. After table data is extracted into a searchable structure, the system uses a trained neural network to determine context for a query by relating query terms to values in the searchable data structure, then generates response options based on that context. For insurance applications, this means a user could query "What is the in-network provider cost per office visit?" against an extracted benefits table, and the system would identify relevant columns, determine that "office visit" maps to specific row entries, and return structured response options with coordinate references back to the source document.

Domain-specific ontology support: Claim 14 protects the application of domain-specific ontologies to the extraction process. Claim 15 specifies that these ontologies relate to insurance policy terms, medical information, or medications. This is significant because it means EXL's extraction system is not purely generic. It can leverage insurance-specific vocabularies and medical taxonomies to improve extraction accuracy and resolve ambiguities in source documents. The patent description mentions that a predetermined accuracy threshold (for example, 0.7, 0.8, or 0.9) can be used to identify top medication matches against a reference database, even when source documents contain spelling irregularities. From tracking EXL's product disclosures, their Xtrakto.AI case studies report 90 to 97% accuracy on medical impairment extraction from attending physician statements, with a capacity to process approximately 30 million pages in four weeks.

What this patent does NOT protect: The patent does not claim specific OCR techniques or particular neural network architectures beyond the general CNN/ResNet family. It does not claim methods for processing audio or video data. It does not protect the ontologies themselves, only the method of applying ontologies during extraction. And it does not claim methods for training the extraction models, meaning competitors could build similar CNN-based table extraction systems using different training methodologies without infringing.

Patent US 12,353,832: Generic Contextual Named Entity Recognition (Generic NER)

The second patent in the pipeline was granted on July 8, 2025, and covers a method for identifying entities in unstructured text and then attaching domain-specific context to those entities without requiring domain-specific model retraining. This addresses a practical deployment problem: conventional NER systems require retraining for each new domain, which means large training datasets and significant compute resources every time a system is adapted for a new carrier or a new line of business.

What the independent claims protect: The method claim (Claim 1) describes a system that receives unstructured data, applies an NLP technique to generate a set of labeled entity tokens (each including a value and an automatically determined data type), uses a reverse question-and-answer model to generate predicted entity keys for each token, and then performs entity alignment by searching a subscriber-provided ontology to determine domain-specific entity keys, calculating similarity values between predicted keys and subscriber keys.

The chat session claim (Claim 8) extends this to real-time conversational contexts: while a chat session is active, the system performs NER operations on the transcript, generates predicted entity keys, performs semantic matching to determine subscriber-specific keys with confidence scores, and visually emphasizes recognized items in the chat interface or displays the determined keys and confidence scores.

The reverse Q&A technique: This is the most novel element of the patent. Rather than training a model to recognize specific entity types in advance, the system extracts entities generically and then asks questions about them to determine what they are. For example, given a value "1978454970" of type "numeric," the system generates the question "What is value '1978454970' of type 'numeric'?" and causes a Q&A model (the patent mentions ROBERTA as an example) to predict the answer, such as "order id." This reverse Q&A approach means the model can identify entity types it was never explicitly trained on, because it is leveraging the general language understanding capabilities of the Q&A model rather than a fixed entity classification vocabulary.

Entity alignment with subscriber ontologies: The second step is where domain specificity enters without model retraining. The predicted entity keys from the Q&A step are matched against a subscriber-provided ontology using similarity techniques. The patent describes three specific methods: Levenshtein distance (with example thresholds of 0.6 to 0.8), Jaro-Winkler distance (with thresholds of 0.9 to 0.95), and Longest Common Subsequence (with thresholds of 0.5 to 0.7). For insurance applications, this means the same NER system could process claims documents for one carrier where the relevant field is called "Claim Number" and another carrier where the same field is called "Case Reference ID," without retraining. The entity alignment step handles the mapping.

The pipeline architecture: The patent describes a BERT-CRF architecture for entity extraction (BERT for contextual embeddings, Conditional Random Fields for sequential labeling), followed by the ROBERTA-based reverse Q&A step, followed by the similarity-based entity alignment. This three-stage pipeline (extract, predict key, align to domain) is designed to be deployable across EXL's entire client base with only the alignment dictionary changing per subscriber.

What this patent does NOT protect: The patent does not claim specific NER model architectures (BERT and ROBERTA are described as examples, not requirements). It does not protect the subscriber ontologies themselves. It does not claim methods for generating training data for the NER models. And it does not claim entity extraction from structured data, only unstructured text. A competitor could build a similar reverse Q&A based NER system using different underlying models (for example, using a different transformer architecture) without infringing the specific claims, though the overall pipeline concept of "extract, reverse Q&A, align" would be harder to design around.

Patent US 12,482,215: Knowledge Retrieval Techniques (Knowledge Graph)

The third patent, granted November 25, 2025, addresses what happens after documents have been processed and entities extracted. Rather than storing results in conventional databases or flat files, this patent describes methods for constructing, querying, and incrementally updating knowledge graphs that capture relationships between concepts across document collections.

What the independent claims protect: The media claim (Claim 1) describes a system that receives a digital document, segments it into content chunks based on dynamically identified boundaries using one or more chunking algorithms (content-based using section headings, content-based using named entity recognition, page-based, or fixed-size), uses those chunks to generate concepts and relationships through proximity analysis on tokens or analysis of resource metadata, generates computer-executable code (specifically mentioning Cypher queries for Neo4j graph databases) to create node triples (first concept node, second concept node, relationship node) in a graph structure, executes that code, generates vector embeddings indexed to node identifiers, receives queries, and queries the embeddings to identify relevant nodes.

The context window solution: The patent's detailed description is unusually candid about the technical problem it solves. It explicitly names the "context window problem" that limits LLM processing of large document sets, noting specific context window sizes for known models (Llama 3.1 at 128,000 tokens, Claude 3.5 at 200,000 tokens, GPT-3.5 at 4,000 tokens). Rather than feeding entire documents into LLMs, the system constructs knowledge graphs where relationships are pre-computed and stored as node triples with embeddings. Queries then search the comparatively compact graph representation rather than the full document corpus. This is a practical engineering solution: a set of insurance policy documents, claims files, and regulatory filings that might total millions of tokens can be represented as a knowledge graph small enough to query efficiently.

Incremental update mechanism: Claim 6 protects a critical capability: when source documents are updated, the system hashes existing content chunks, identifies which chunks changed, generates updated node triples only for the changed chunks, and updates the graph structure and associated embeddings accordingly. This avoids full graph regeneration, which the patent describes as "computationally expensive" for conventional knowledge graph systems. For insurance operations where documents are frequently revised (endorsements to policies, supplemental medical records on claims, amended regulatory filings), incremental updates mean the knowledge graph stays current without the cost of rebuilding from scratch.

Multi-modal data integration: Claims 3 through 5 protect pre-processing modules for non-textual data. Computer code can be processed through static or dynamic analysis to generate textual representations. Audio data can be transcribed via speech-to-text processing. Images can be processed via computer vision to generate textual descriptions of their content. In all cases, the output is a textual or alphanumeric representation that can be chunked, concept-extracted, and integrated into the knowledge graph alongside text from conventional documents. For insurance, this means call recordings from FNOL, photographs from property inspections, and even code from actuarial models could theoretically be integrated into a unified knowledge graph.

Node triples and relationship types: The patent description details the types of relationships that can be represented as node triples: hierarchical (car is a subclass of vehicle), causal (exercise causes improvement), spatial (property is located in county), temporal (claim occurred before policy renewal), semantic (deductible is similar to retention), part-whole (policy has endorsements), and attribute-based (insured has coverage limit of $1 million). For insurance document processing, this rich relationship vocabulary means the knowledge graph can capture the kinds of complex, multi-entity relationships that characterize insurance data, such as "claimant was treated by provider for condition diagnosed after incident covered by policy issued by carrier."

What this patent does NOT protect: The patent does not claim specific graph database implementations (Neo4j and Cypher are described as examples). It does not protect specific embedding models (AWS Titan is mentioned as an example). It does not claim the FAISS vector search framework or any particular similarity search algorithm. And it does not protect the chunking algorithms themselves (content-based, page-based, and fixed-size chunking are all well-established techniques). The novelty is in the integrated pipeline: dynamic chunking to concept extraction to node triple generation to embedding indexing to incremental updating, applied to multi-modal document sets.

How the Three Patents Work Together

Viewed individually, each patent protects a distinct capability. Viewed together, they describe a pipeline that could process the full lifecycle of an insurance document:

A bodily injury claim arrives as a scanned PDF containing medical records, treatment summaries in tabular format, and narrative descriptions. The Xtrakto.AI patent (12,260,342) handles the initial extraction: CNN-based cell detection pulls structured data from the medical tables, OCR converts narrative text, and domain-specific ontologies resolve medication names and diagnostic codes with up to 97% accuracy. The Generic NER patent (12,353,832) then processes the extracted text: the BERT-CRF pipeline identifies entities, the reverse Q&A model predicts what those entities represent, and the entity alignment step maps them to the specific carrier's data schema without model retraining. Finally, the Knowledge Graph patent (12,482,215) structures everything into a queryable graph: concepts like "claimant," "provider," "diagnosis," and "treatment" become nodes connected by typed relationships, with vector embeddings enabling semantic search across the full claims file and any related documents.

When supplemental medical records arrive weeks later, the incremental update mechanism identifies what changed and updates only the affected nodes and embeddings. When a claims adjuster queries "What medications was the claimant prescribed after the incident?" the system searches the knowledge graph embeddings rather than re-processing the full document set.

This pipeline is directly comparable to AIG's extraction architecture, but the two approaches solve different problems. AIG's three patents focus exclusively on underwriting submission intake: extracting data from E&S submissions so underwriters can evaluate risks faster. EXL's three patents cover a broader surface area (any unstructured insurance document, not just submissions) and extend further down the analytical chain (through knowledge graphs and semantic search, not just extraction and storage). However, AIG's patents include a traceability and hallucination detection layer (Patent 12,437,154) that EXL's pipeline does not address in these three patents. If a knowledge graph contains an incorrect relationship inferred during concept extraction, EXL's patents do not describe a mechanism for detecting or flagging that error at query time.

What Carriers Should Consider

For carriers evaluating document processing and knowledge management capabilities, EXL's data ingestion patents raise several practical considerations.

The subscriber ontology model described in the Generic NER patent means carriers must provide and maintain domain-specific ontologies for entity alignment. The quality of those ontologies directly affects extraction accuracy. Carriers with well-maintained data dictionaries and standardized field naming conventions will see better results than those with inconsistent legacy schemas.

The knowledge graph incremental update mechanism assumes that source documents can be cleanly versioned and that chunk boundaries remain consistent across versions. In practice, insurance documents are frequently revised in ways that change document structure (pages reordered, sections added or removed), which could complicate the hash-based change detection described in the patent.

The multi-modal integration capability (code analysis, audio transcription, image processing) is described in the patent claims but is not yet prominently featured in EXL's product marketing for Xtrakto.AI, which currently emphasizes document and image processing. Carriers interested in integrating call recordings or inspection photographs into knowledge graphs should verify the maturity of these capabilities before contracting.

Finally, because EXL deploys these patented methods within managed service engagements rather than licensing them as standalone software, carriers should understand the IP implications of the relationship. The extracted data belongs to the carrier, but the methods used to extract it are EXL's patented property. If a carrier wants to build similar capabilities in-house after an EXL engagement ends, the specific pipeline architecture described in these patents would need to be designed around.

Sources

U.S. Patent No. 12,260,342, "Multimodal table extraction and semantic search in a machine learning platform for structuring data in organizations," granted March 25, 2025, assigned to ExlService Holdings, Inc. patents.google.com
U.S. Patent No. 12,353,832, "Generic contextual named entity recognition," granted July 8, 2025, assigned to ExlService Holdings, Inc. patents.google.com
U.S. Patent No. 12,482,215, "Knowledge retrieval techniques," granted November 25, 2025, assigned to ExlService Holdings, Inc. patents.google.com
U.S. Patent No. 11,842,286, "Machine learning platform for structuring data in organizations," granted December 12, 2023, assigned to ExlService Holdings, Inc. patents.google.com
U.S. Patent No. 12,033,408, "Continual text recognition using prompt-guided knowledge distillation," granted July 9, 2024, assigned to ExlService Holdings, Inc. patents.google.com
U.S. Patent No. 12,437,155, "Auto-extracting tabular and textual information from digital documents for populating data retrieval systems," granted February 11, 2025, assigned to American International Group, Inc. patents.google.com
EXL, "Transforming Insurance Operations with EXL XTRAKTO.AI," solution brief. exlservice.com
EXL, "Life insurer improves underwriting efficiency with EXL XTRAKTO.AI," case study. exlservice.com
EXL, "AI Document Extraction & Processing (Intelligent Document Processing)," product page. exlservice.com
EXL, "EXL granted 10 new patents in the last year for AI solutions," GlobeNewsWire, February 9, 2026. globenewswire.com