Case studies
AI-Powered Entity Resolution Pipeline for Unstructured Financial Documents
An AI pipeline that links unstructured company presentations to a structured GICS company database — cascading PDF extraction that handles image-only scans, cost-aware sampling that holds language-model spend flat, and a multi-metric matching cascade built to clear an 85% match rate.
Introduction
A financial data and market-intelligence provider had spent years accumulating two assets that were valuable on their own and far more valuable together, if only they could be connected.
The first was an Amazon S3 bucket of company presentations, stored as PDFs and growing all the time: investor decks, corporate overviews, profiles of public and private companies. The second was a structured master database of companies, each record carrying a primary name, an alternative name, a three-level GICS industry classification, and a location.
Apart, each asset left the other underused. The presentations were a deep well of unstructured detail with no index into the company universe the business already tracked. The database was a clean, queryable index with no link to the documents that gave each company its depth. The value sat in joining the two, and that join did not exist: nothing inside a PDF announced, in machine-readable form, which database record it belonged to.
Building that link reliably, across thousands of inconsistent presentations, is the problem this pipeline was built to solve.
The reason it was worth solving lies in what the link unlocks. Once every presentation is tied to a company record, the archive stops being a pile of static files and becomes part of the company’s searchable intelligence layer. An analyst can move from a company record straight to its supporting documents, and any document can be read against the broader company universe instead of in isolation. That is the real shift: from an archive that merely exists to one the business can ask questions of.
The Challenge
On paper, the task is a join between two datasets. In practice, the two datasets had nothing in common that a join could use. Connecting them meant opening each document, working out which company it described, and locating that company in the database. Every one of those steps resisted automation.
PDFs without text. A large share of the presentations carried no extractable text layer at all. They had been scanned, or exported as flat images, so the obvious first step, reading the text out of the PDF, simply returned nothing. Any approach that assumed clean, selectable text would fail silently on a meaningful fraction of the archive.
Documents too large to process naively. Presentations regularly ran well past 100 pages. Passing a document of that size to a language model is workable once; doing it for every file in a large and growing bucket is slow, and the API cost climbs with every page. Processing the archive in full, the naive way, was never economically viable.
No clean way to match a name. Even with the text in hand, company names rarely lined up. A deck might use a trading name, a former name, a localised name, or an abbreviation, while the database held a legal name and a single alternative. Location and GICS classification could help disambiguate, but there was no shared identifier between the two sources, and no single string-similarity metric that stayed accurate across the full range of mismatches.
Underlying all of it was a firm target: a successful match rate of at least 85%, the threshold at which the linked dataset became dependable enough to build on.
The Solution
We approached the problem as a four-stage pipeline (ingestion, text extraction, data preparation, and data extraction) feeding a final multi-metric matching step. Two design principles shaped it.
The first is a deliberate division of labour between probabilistic and deterministic work. Reading a messy, inconsistent document and pulling a company’s identity out of it is a job for a language model: it absorbs ambiguity, varied phrasing, and missing structure in a way rule-based parsing cannot. Deciding whether two records refer to the same company, and standing behind that decision, is not. That step is deterministic: inspectable matching logic and explicit validation, where the same inputs always produce the same result and every match can be explained.
Language models handle ambiguity and extraction; deterministic matching and validation handle reliability.
The second principle is cost discipline. Every stage does the cheapest thing that can possibly work, escalates only when it has to, and caches the answer so it never has to work again. A document-processing system that treats compute and API cost as afterthoughts can produce correct results and still be too expensive to run. Here, cost was a design constraint from the first stage to the last.
How the Pipeline Works

Resilient document ingestion. The pipeline begins by mounting the S3 bucket as a local filesystem through s3fs, which lets the rest of the code treat remote objects as ordinary files. Downloads run in parallel across multiple worker threads, so throughput is bound by bandwidth rather than by round-trip latency on one connection at a time. Every file is cached locally the first time it is fetched. From then on, any re-run, whether a code change during development or a re-processing pass after a logic fix, reads from the local cache and skips the network entirely. For an archive of this size, that single decision turns a slow, repeated download into a one-time cost.
Cascading text extraction. Because so many PDFs lack a usable text layer, extraction is built as a fallback chain rather than a single step. The pipeline first tries fast, local extraction with pymupdf and pymupdf4llm, converting each document to Markdown, structure-aware text that downstream stages can work with cleanly. When that produces nothing usable, the file is escalated to LlamaParse, a heavier parsing service that handles scanned pages, image-based content, and awkward layouts local extraction cannot. The ordering is deliberate: the fast, free path handles the majority of documents, and only the genuinely difficult files reach the slower, paid tier. The archive is covered in full without paying premium extraction cost for every file.
Data preparation — cost-aware sampling. A 100-page presentation does not need to be read in full to answer the only question that matters at this stage: which company is this? The identifying detail almost always clusters at the front of a deck and again at the back, in title slides, cover pages, and closing contact information. Rather than send entire documents to the model, the pipeline samples around 5,000 tokens from the beginning, 5,000 from the middle, and 5,000 from the end, always cutting on whole-page boundaries so no page is fed in half. The effect is a hard ceiling on cost per document: a 200-page deck and a 20-page deck are processed for effectively the same price, and the archive’s length distribution stops driving the budget.
Data extraction — chunked extraction. The sampled text is split into chunks sized to fit comfortably within the model’s context window, and each chunk is passed to gpt-4o-mini, which extracts a structured set of company attributes: name, location, and the surrounding identifying signals. Splitting before extraction keeps every model call well within its limits and makes the stage resilient to unusually dense documents. The per-chunk results are then merged into a single candidate record, reconciling partial or repeated findings into one coherent description of the company the presentation is about.
Multi-metric matching. With a candidate record assembled, the pipeline matches it against the company database, and this is where the absence of a shared identifier is met head-on. Instead of trusting one similarity measure, the matching stage runs a cascade of them, ordered from simple to complex: token-set and token-sort ratios, Ratcliff–Obershelp, Editex, Refined Soundex, weighted Jaccard, ROUGE-L, partial-string and bag distances, among others. Each metric has different strengths: some forgive word order, others spelling or missing tokens. A name mismatch that defeats one is often caught cleanly by another. Country names are normalised to ISO-3 codes with country_converter so location never fails to match on formatting alone, and GICS classification and location act as tie-breakers when names are genuinely ambiguous. No single metric is treated as authoritative; it is agreement across the cascade that produces a match the business can trust.
Architecture
The pipeline is built end to end in Python, chosen less for raw speed than for the breadth of its data and document-processing ecosystem. Nearly every component, from S3 access to PDF parsing to string matching, exists as a mature, well-supported library.
Polars handles all of the tabular work, the company database and every intermediate dataset, with a dataframe model that stays fast and memory-efficient as the data grows. joblib runs the heavy stages, which parallelise cleanly across workers, so ingestion and extraction scale with available cores rather than processing one document at a time.
The defining architectural decision, though, is caching at every boundary. Downloaded files, extracted Markdown, and prepared text are each persisted as they are produced. That makes the whole pipeline idempotent: a re-run never repeats work that has already succeeded, only work that has not. During development this turns a multi-hour pass into a fast feedback loop; in production it keeps re-processing a continuously growing archive inexpensive. Every stage is guarded by its own cache check, traced in the detailed flow below.

Technology Stack
- Language & data — Python as the implementation language; Polars for fast, memory-efficient tabular data; s3fs to mount Amazon S3 as a filesystem.
- PDF extraction — pymupdf and pymupdf4llm for fast local PDF-to-Markdown conversion; LlamaParse as the fallback for scanned and image-based documents.
- AI extraction — gpt-4o-mini for structured extraction of company attributes from document text.
- Parallelism — joblib for running the heavy stages across multiple workers.
- Normalisation — country_converter for resolving country names to consistent ISO-3 codes.
- Matching — fuzzywuzzy, python-Levenshtein, and name_matching for the cascade of string-similarity metrics.
Outcomes
In test runs on representative document batches, the pipeline extracted company identity and produced matches in line with the project’s 85% target, holding up against real, messy documents rather than curated examples. The more durable outcomes are in the engineering:
- Every PDF in the archive becomes processable; image-only and scanned files no longer fall silently out of the dataset.
- Processing cost scales with the number of documents, not their combined page count, so the budget stays predictable as the archive grows.
- Ambiguous, mismatched company names resolve into confident links, because no single metric has to be right on its own.
- A continuously growing archive can be re-processed at near-zero marginal cost, thanks to caching at every stage.
- Each stage (ingestion, extraction, matching) is independent and swappable, so the pipeline adapts instead of being rebuilt.
Together these turn an opaque pile of PDFs into the linked, queryable data a searchable intelligence layer is built from.
Closing
The hard part of most data problems is not storing documents; it is connecting them to what an organisation already knows. This pipeline is a disciplined, repeatable answer to that problem: cost-aware where it processes, rigorous where it matches, and general enough to carry from one document archive to the next. It is a capability we bring to any engagement where unstructured documents and structured data finally need to meet.
Try us for 14 days
Want to start a 2-week free trial period with us? Leave your email below and we'll revert to you shortly with more details