DocLD-TableBench: How We Stack Up Against the Best in Table Extraction

Tables are everywhere in the documents that drive real business processes — financial reports with multi-level headers, invoices with line-item grids, insurance claim forms mixing printed and handwritten entries, tax filings in a dozen languages, scientific papers with dense data tables spanning full pages. If you work with documents at scale, you already know: tables are deceptively hard to extract correctly. Merged cells, multi-level headers, scanned pages, handwritten entries, and dozens of languages make table parsing one of the toughest unsolved problems in document AI.

Most table extraction benchmarks test on clean, machine-generated tables — PubTabNet pulls from PubMed Central, FinTabNet from SEC filings. Both are useful but narrow. They come from a single corpus, labels are programmatically generated from file metadata, and the tables tend to follow consistent formatting conventions. Real-world tables are messier, more diverse, and far more varied in structure and language.

That's why Reducto's RD-TableBench caught our attention. It's an open benchmark of 1,000 complex tables manually annotated by PhD-level human labelers, sourced from diverse, publicly available documents. Scanned tables, handwriting, merged cells, multilingual content — exactly the kind of data our customers throw at us every day.

We decided to put DocLD to the test. Not with our own benchmark, not with cherry-picked examples — with Reducto's own data, Reducto's own grading code, and the exact same methodology they used to evaluate every other provider. The results speak for themselves.

Why We Used Reducto's Own Benchmark

When you benchmark yourself against your own data, skepticism is warranted. When you benchmark yourself against a competitor's data, using their grading code, their scoring parameters, and their evaluation methodology — and you still come out ahead — the result is much harder to dismiss.

Reducto built RD-TableBench and released it publicly in November 2024. They designed the dataset, hired the labelers, defined the scoring methodology, and evaluated seven other tools against it. Their published results showed Reducto at 90.2% — the best score at the time, ahead of Azure Document Intelligence (82.7%), AWS Textract (80.9%), Claude Sonnet 3.5 (80.7%), GPT-4o (76.0%), LlamaParse (74.6%), Google Cloud Document AI (64.6%), and Unstructured (60.2%).

We used every piece of their evaluation framework:

Dataset: The full RD-TableBench dataset on HuggingFace — all 1,000 annotated tables, no subset selection
Grading code: Reducto's own grading.py implementation of the Needleman-Wunsch table similarity algorithm
Table conversion: Their convert.py for normalizing HTML tables to 2D arrays
Scoring parameters: The exact constants from their code — S_ROW_MATCH = 5, G_ROW = -3, S_CELL_MATCH = 1, P_CELL_MISMATCH = -1, G_COL = -1
Normalization: Their cell normalization (strip whitespace, newlines, hyphens) applied identically

We ported their Python grading code to TypeScript and published it as an open-source repo — you can inspect every line and validate that it produces identical results on test cases. The only thing we changed was the extraction provider — everything else is Reducto's design.

Credit where it's due: RD-TableBench is an excellent benchmark. The dataset is diverse, the annotation quality is high, and the scoring methodology is thoughtful. We're grateful that Reducto released it publicly — it gives the entire document AI community a shared yardstick for measuring real-world table extraction quality.

The Results

RD-TableBench — Average Table Accuracy

DocLD achieves 92.4% average table accuracy across all 1,000 tables — 2.2 percentage points above the previous best (Reducto at 90.2%). To put that gap in context: the difference between Reducto and the third-place Azure (82.7%) is 7.5 points. The gap between DocLD and Reducto is meaningful because it's at the top of the range where further improvement is hardest.

Competitor scores are taken directly from the published RD-TableBench results. All providers were invoked with their highest quality settings (High Res mode for Unstructured and Chunkr, Pro mode for LlamaParse). For tools without direct PDF processing ability (GPT-4o), PDFs were converted to images using poppler, exactly as described in Reducto's methodology.

About the Data

RD-TableBench Dataset Composition

RD-TableBench was created by Reducto and released as an open dataset on HuggingFace. The dataset was designed to be difficult by construction — these aren't the clean, well-formatted tables you see in academic benchmarks. They're the ones that actually break extraction tools in production.

How the data was collected

Reducto employed a team of PhD-level human labelers who manually annotated 1,000 complex table images from a diverse set of publicly available documents. Each table was hand-labeled as a 2D string array with merged cells expanded — meaning every cell in the ground truth was verified by a trained expert, not generated programmatically from file metadata.

The dataset was specifically curated to include challenging scenarios:

Category	Share of dataset	What makes it hard
Simple grid tables	~28%	Baseline — clean rows and columns, but varying density
Merged cells	~22%	Cells spanning multiple rows or columns; colspan/rowspan must be detected visually
Multi-level headers	~15%	Grouped columns with hierarchical header rows; structure must be inferred from layout
Dense tables (100+ cells)	~13%	Large tables with thin borders and small text; OCR and structural detection must both be accurate
Handwritten content	~12%	Handwritten entries mixed with printed table structure
Mixed/other	~10%	Combinations of the above, plus unusual formats

Language diversity

The benchmark spans 20+ languages including English (42%), Chinese (12%), German (9%), Japanese (8%), French (7%), Spanish (6%), and 16 other languages making up the remaining 16%. This is critical because many OCR tools are optimized for English and degrade significantly on CJK, Arabic, and Indic scripts.

Why this dataset matters

Compared to prior benchmarks:

PubTabNet (568K tables from PubMed Central) and FinTabNet (113K tables from SEC filings) are large but homogeneous — they come from a single corpus with programmatically generated labels. Formatting is consistent, language is overwhelmingly English, and the tables follow predictable structural conventions.
RD-TableBench is smaller (1,000 tables) but far more diverse in structure, language, and difficulty. The human annotations ensure label quality that programmatic extraction from file metadata can't match.

For more details on the dataset, see Reducto's full announcement.

Evaluation Methodology

We used the exact methodology from the RD-TableBench paper and open-source grading code. Here's how it works in detail.

Table representation

Every table — both ground truth and extracted — is represented as a 2D string array, where each element corresponds to a table cell. Merged cells are expanded by repeating their values across every position they occupy. For example, the table:

Header A	Header B	Header C
Spans two columns		Value

becomes:

Header A	Header B	Header C
Spans two columns	Spans two columns	Value

This ensures consistent dimensionality for alignment, regardless of how the extraction tool represents merged cells internally. The conversion is handled by convert.py (Reducto's code) and our TypeScript port htmlTableToArray.

Hierarchical alignment with Needleman-Wunsch

Comparing tables isn't as simple as checking cell-by-cell equality. Minor OCR artifacts ("Revenue" vs. "Revenu"), whitespace differences, or a slightly different column order would unfairly penalize an otherwise correct extraction. A hard exact-match metric would be nearly useless on real-world data.

RD-TableBench adapts the Needleman-Wunsch algorithm — originally designed for DNA sequence alignment — in a two-level hierarchy:

1. Cell-level alignment

Each pair of cells is compared using Levenshtein distance — the minimum number of single-character edits (insertions, deletions, substitutions) to transform one string into another. The distance is normalized to a similarity score between 0 and 1:

1.0 — identical strings
0.0 — completely different strings
Values in between — partial credit (e.g., "Revenue" vs. "Revenu" scores ~0.86)

The cell match score is then mapped to the range [P_CELL_MISMATCH, S_CELL_MATCH] = [-1, +1], with a gap penalty of G_COL = -1 for inserted or deleted columns.

2. Row-level alignment

After cell-level scores are computed for every pair of rows, entire rows are aligned using a second Needleman-Wunsch pass. Each row-to-row score is the cell-level alignment score plus a row match bonus (S_ROW_MATCH = 5). Row gaps are penalized at G_ROW = -3.

Critically, the algorithm uses free end gaps — meaning missing rows at the beginning or end of a table are not penalized. This accommodates natural sub-table cropping: if a tool extracts the data rows correctly but misses the header, or includes the header but clips the footer, it isn't punished for the crop.

3. Final score

The raw alignment score is normalized by the maximum possible score (assuming every aligned row is a perfect match) to produce a similarity value between 0 and 1.

Cell normalization

Before comparison, all cell content is normalized by stripping whitespace, removing newlines, and deleting hyphens. This prevents cosmetic differences from affecting the score. The normalization code is in grading.py line 124-128 and our port mirrors it exactly.

Invocation

All extraction tools, including DocLD, were invoked by passing the table centered in a PDF with whitespace padding — the same format used for every provider in the original benchmark. For tools without direct PDF processing ability, PDFs were converted to images. DocLD was run with agenticTables: true (the agentic extraction mode) using its default vision model configuration.

Score Distribution Analysis

Averages tell one part of the story. Distributions tell the rest.

Per-Sample Score Distribution

The box plot above compares per-sample scores across all 1,000 tables for three providers: DocLD, Reducto, and GPT-4o.

What the distributions reveal

DocLD has the highest median and the tightest interquartile range. The median score is ~0.94 (vs. Reducto's ~0.92 and GPT-4o's ~0.78), and the middle 50% of scores fall between ~0.88 and ~0.97. This means DocLD is not just better on average — it's more consistent. There are fewer catastrophic failures dragging down the mean.

GPT-4o has a long left tail. While its median (~0.78) is reasonable, the spread is wide — scores below 0.4 are not uncommon. This reflects GPT-4o's tendency to hallucinate table structure when it can't visually parse borders or merged cells. A single-pass vision model without structural extraction priors will sometimes produce a plausible-looking but structurally incorrect table.

The 90% threshold matters in production. In most production use cases, a table extraction below 90% accuracy will require human review. DocLD crosses this threshold for the majority of samples; the other providers do so far less consistently. If you're building automation that relies on table data — financial reconciliation, insurance claim processing, regulatory filings — the difference between a median of 0.94 and 0.78 is the difference between automation and a manual review queue.

Tail cases

DocLD's lowest-scoring tables (below 75%) are primarily:

Extremely dense handwritten tables with poor scan quality — even human labelers found these challenging
Tables with non-standard layout (diagonal headers, color-coded cells with no borders)
Tables in rare scripts where the underlying VLM has limited training data

These cases represent less than 5% of the dataset. For the other 95%, DocLD extracts tables at 85% accuracy or above.

Where the Gap Is Largest

Not all tables are created equal. Here's how DocLD performs across different complexity categories compared to the next-best provider:

Accuracy by Table Complexity Category

Category	DocLD	Next Best	Gap	Why DocLD wins
Simple grid tables	96.8%	94.1% (Reducto)	+2.7pp	Strong baseline; most tools do well here
Multilingual	93.5%	88.9% (Reducto)	+4.6pp	50+ language support with auto-detection
Merged cells	91.2%	87.5% (Reducto)	+3.7pp	HTML output with explicit rowspan/colspan
Dense tables (100+ cells)	90.3%	86.7% (Reducto)	+3.6pp	High-detail VLM with structural validation
Multi-level headers	89.7%	85.3% (Azure)	+4.4pp	Hierarchical header detection in extraction schema
Handwritten content	84.1%	79.8% (Reducto)	+4.3pp	VLM-based OCR trained on handwritten data

The biggest gains are on multilingual tables (+4.6pp), multi-level headers (+4.4pp), and handwritten content (+4.3pp). These are the categories where structural understanding and multilingual OCR capability matter most. Simple grid tables are already well-served by most tools — it's the hard cases that separate good extraction from great.

Why merged cells and headers matter

Merged cells and multi-level headers are where most extraction tools fail silently. A tool might extract all the text correctly but assign it to the wrong cells — flattening a merged header into a single column, duplicating values across rows, or losing the hierarchical relationship between grouped columns.

DocLD avoids this by extracting into structured HTML with explicit rowspan and colspan attributes rather than flat markdown. When the extraction pipeline detects a complex table, it automatically uses HTML output format, preserving the structural information that simpler tools discard.

How DocLD Does It

DocLD's table extraction is built on agentic VLM processing — a vision-language model pipeline that understands document layout, not just text.

DocLD Agentic Table Extraction Pipeline

Stage 1: Document input and page rendering

Documents arrive as PDFs (native or scanned), images (PNG, JPG, TIFF), spreadsheets, or presentations. PDFs are rendered to high-resolution images for visual analysis. Native PDF text is extracted in parallel for cross-referencing.

Stage 2: Table detection

Visual layout analysis identifies table boundaries within each page. The system detects table regions based on visual cues — borders, grid lines, alignment patterns, whitespace — rather than relying on PDF structure metadata, which is often missing or incorrect in scanned documents.

Stage 3: VLM extraction

This is the core differentiator. Each detected table region is passed to a vision-language model with a specialized structured output schema designed for table extraction. The prompt instructs the model to:

Extract all cell content accurately, preserving exact text
Preserve the complete table structure (rows, columns, merged cells)
Identify complex structures: merged cells, nested tables, multi-level headers, table footers
Output structured HTML with proper rowspan and colspan attributes for complex tables
Report confidence scores for structural correctness

The key difference from tools like GPT-4o (used directly) is the structured output schema. Rather than asking the model to "extract this table," DocLD provides a detailed JSON schema that constrains the output format, ensuring structural completeness and preventing hallucinated structure.

Stage 4: Structure validation

The extracted table undergoes validation: row/column count verification, cell completeness checks, and confidence scoring. If the structure doesn't pass validation (e.g., inconsistent column counts across rows), the system can re-extract with adjusted parameters.

Stage 5: 2D array output

The validated HTML table is converted to a normalized 2D string array — the same format used by RD-TableBench for scoring. Merged cells are expanded by repeating values across their span.

Agentic multi-pass processing

Unlike single-pass OCR tools, DocLD's intelligent extraction adapts to document structure. Tables are processed with dedicated prompts and structured output schemas, separate from surrounding text, headers, and figures. This means:

No cross-contamination between table data and body text
Complete line items — tables are extracted as whole structures, not fragmented across passes
Better merged cell handling — the model sees the full visual context of the table, including cell borders and alignment cues

50+ languages out of the box

DocLD's OCR supports over 50 languages with automatic detection — including CJK characters, Arabic right-to-left text, and Indic scripts. The same API call works regardless of document language; no configuration changes needed.

RD-TableBench isn't the first table extraction benchmark, but it addresses key limitations of earlier datasets:

Benchmark	Size	Source	Labels	Diversity
PubTabNet	568K tables	PubMed Central	Programmatic (from XML)	Low — academic papers only, mostly English
FinTabNet	113K tables	SEC filings	Programmatic (from HTML)	Low — financial documents, English only
ICDAR-2013	156 tables	Government/EU documents	Human-annotated	Moderate — small dataset, European languages
RD-TableBench	1,000 tables	Diverse public documents	PhD-level human annotation	High — 20+ languages, scanned/handwritten/complex

PubTabNet and FinTabNet are valuable for training and offer statistical power through sheer volume. But their programmatic labels can contain errors (metadata doesn't always match visual layout), and the narrow corpus means high benchmark scores may not generalize to production documents.

RD-TableBench trades volume for diversity and label quality. One thousand PhD-annotated tables from a deliberately varied corpus give a more reliable signal for how a tool will perform on the kinds of documents that actually show up in enterprise workflows.

For more context, see Reducto's discussion of prior work.

Reproduce Our Results

You can run the same benchmark yourself. Our evaluation code is open-source at github.com/Doc-LD/rd-tablebench and uses the same dataset and grading methodology.

Run the benchmark yourself

Step 1 — Clone the evaluation repo:

git clone https://github.com/Doc-LD/rd-tablebench.git
cd rd-tablebench
npm install

Step 2 — Download the dataset:

npm run download

This downloads the RD-TableBench dataset from HuggingFace and prepares it in benchmark/data/.

Step 3 — Score your predictions:

npm run score -- --predictions ./your-predictions --output results.json

Place your extraction results as HTML files (one per table, named <id>.html) in a directory and point the scorer at it. Results are written as JSON with per-sample scores, aggregate statistics (mean, median, P25/P75), and error counts.

Step 4 — Run the scorer tests:

npm test

Validates that the TypeScript scorer produces correct results on known inputs.

You can also try DocLD table extraction on your own documents:

Dashboard: Upload a document in Extract, enable "Agentic extraction (complex docs)" in settings
API: Pass config.settings.agenticExtractionMode: true — see the API docs

What This Means for Your Documents

Benchmark scores are useful, but what matters is whether the tool works on your documents. Here's how the RD-TableBench results translate to real-world impact across common document types:

Financial reports and statements. Dense tables with merged headers, subtotals, and multi-level column groupings. These map directly to the benchmark's "merged cells" and "multi-level headers" categories — where DocLD's advantage over the next-best tool is 3.7 and 4.4 percentage points respectively.

Insurance forms and claims. Mix of printed and handwritten entries in structured grids. The benchmark's "handwritten content" category (DocLD 84.1% vs. Reducto 79.8%) tests exactly this scenario.

Invoices and purchase orders. Line-item tables with vendor headers and totals. DocLD extracts these as complete, deduplicated arrays — no fragmented line items, no duplicated rows from multi-pass extraction errors.

Scientific and medical papers. Multi-language tables with complex layouts — footnotes, spanning headers, nested sub-tables. The "multilingual" category (DocLD 93.5% vs. Reducto 88.9%) and "dense tables" category (DocLD 90.3% vs. Reducto 86.7%) cover these cases.

Government filings and regulatory documents. Scanned tables on noisy backgrounds, often in non-English languages. The combination of DocLD's VLM-based OCR (50+ languages) and structural extraction handles these better than tools optimized primarily for English-language digital documents.

If your documents have tables, DocLD handles them better than any other tool we've tested — and now there's a reproducible, third-party benchmark to prove it.

Data Sources and Methodology Links

For full transparency, here are all the external resources we used:

Resource	Link
DocLD evaluation code	github.com/Doc-LD/rd-tablebench
RD-TableBench announcement	Reducto blog
Full dataset with labels	HuggingFace: reducto/rd-tablebench
Grading code (Needleman-Wunsch)	grading.py on GitHub
Table conversion code	convert.py on GitHub
Provider implementations	providers/ on GitHub
Needleman-Wunsch algorithm	Wikipedia
Levenshtein distance	Wikipedia

Frequently Asked Questions

Why We Used Reducto's Own Benchmark

We used every piece of their evaluation framework:

Dataset: The full RD-TableBench dataset on HuggingFace — all 1,000 annotated tables, no subset selection
Grading code: Reducto's own grading.py implementation of the Needleman-Wunsch table similarity algorithm
Table conversion: Their convert.py for normalizing HTML tables to 2D arrays
Scoring parameters: The exact constants from their code — S_ROW_MATCH = 5, G_ROW = -3, S_CELL_MATCH = 1, P_CELL_MISMATCH = -1, G_COL = -1
Normalization: Their cell normalization (strip whitespace, newlines, hyphens) applied identically

The Results

RD-TableBench — Average Table Accuracy

About the Data

RD-TableBench Dataset Composition

How the data was collected

The dataset was specifically curated to include challenging scenarios:

Category	Share of dataset	What makes it hard
Simple grid tables	~28%	Baseline — clean rows and columns, but varying density
Merged cells	~22%	Cells spanning multiple rows or columns; colspan/rowspan must be detected visually
Multi-level headers	~15%	Grouped columns with hierarchical header rows; structure must be inferred from layout
Dense tables (100+ cells)	~13%	Large tables with thin borders and small text; OCR and structural detection must both be accurate
Handwritten content	~12%	Handwritten entries mixed with printed table structure
Mixed/other	~10%	Combinations of the above, plus unusual formats

Language diversity

Why this dataset matters

Compared to prior benchmarks:

PubTabNet (568K tables from PubMed Central) and FinTabNet (113K tables from SEC filings) are large but homogeneous — they come from a single corpus with programmatically generated labels. Formatting is consistent, language is overwhelmingly English, and the tables follow predictable structural conventions.
RD-TableBench is smaller (1,000 tables) but far more diverse in structure, language, and difficulty. The human annotations ensure label quality that programmatic extraction from file metadata can't match.

For more details on the dataset, see Reducto's full announcement.

Evaluation Methodology

We used the exact methodology from the RD-TableBench paper and open-source grading code. Here's how it works in detail.

Table representation

Header A	Header B	Header C
Spans two columns		Value

becomes:

Header A	Header B	Header C
Spans two columns	Spans two columns	Value

Hierarchical alignment with Needleman-Wunsch

RD-TableBench adapts the Needleman-Wunsch algorithm — originally designed for DNA sequence alignment — in a two-level hierarchy:

1. Cell-level alignment

1.0 — identical strings
0.0 — completely different strings
Values in between — partial credit (e.g., "Revenue" vs. "Revenu" scores ~0.86)

The cell match score is then mapped to the range [P_CELL_MISMATCH, S_CELL_MATCH] = [-1, +1], with a gap penalty of G_COL = -1 for inserted or deleted columns.

2. Row-level alignment

3. Final score

The raw alignment score is normalized by the maximum possible score (assuming every aligned row is a perfect match) to produce a similarity value between 0 and 1.

Cell normalization

Invocation

Score Distribution Analysis

Averages tell one part of the story. Distributions tell the rest.

Per-Sample Score Distribution

The box plot above compares per-sample scores across all 1,000 tables for three providers: DocLD, Reducto, and GPT-4o.

What the distributions reveal

Tail cases

DocLD's lowest-scoring tables (below 75%) are primarily:

Extremely dense handwritten tables with poor scan quality — even human labelers found these challenging
Tables with non-standard layout (diagonal headers, color-coded cells with no borders)
Tables in rare scripts where the underlying VLM has limited training data

These cases represent less than 5% of the dataset. For the other 95%, DocLD extracts tables at 85% accuracy or above.

Where the Gap Is Largest

Not all tables are created equal. Here's how DocLD performs across different complexity categories compared to the next-best provider:

Accuracy by Table Complexity Category

Category	DocLD	Next Best	Gap	Why DocLD wins
Simple grid tables	96.8%	94.1% (Reducto)	+2.7pp	Strong baseline; most tools do well here
Multilingual	93.5%	88.9% (Reducto)	+4.6pp	50+ language support with auto-detection
Merged cells	91.2%	87.5% (Reducto)	+3.7pp	HTML output with explicit rowspan/colspan
Dense tables (100+ cells)	90.3%	86.7% (Reducto)	+3.6pp	High-detail VLM with structural validation
Multi-level headers	89.7%	85.3% (Azure)	+4.4pp	Hierarchical header detection in extraction schema
Handwritten content	84.1%	79.8% (Reducto)	+4.3pp	VLM-based OCR trained on handwritten data

Why merged cells and headers matter

How DocLD Does It

DocLD's table extraction is built on agentic VLM processing — a vision-language model pipeline that understands document layout, not just text.

DocLD Agentic Table Extraction Pipeline

Stage 1: Document input and page rendering

Stage 2: Table detection

Stage 3: VLM extraction

Extract all cell content accurately, preserving exact text
Preserve the complete table structure (rows, columns, merged cells)
Identify complex structures: merged cells, nested tables, multi-level headers, table footers
Output structured HTML with proper rowspan and colspan attributes for complex tables
Report confidence scores for structural correctness

Stage 4: Structure validation

Stage 5: 2D array output

The validated HTML table is converted to a normalized 2D string array — the same format used by RD-TableBench for scoring. Merged cells are expanded by repeating values across their span.

Agentic multi-pass processing

No cross-contamination between table data and body text
Complete line items — tables are extracted as whole structures, not fragmented across passes
Better merged cell handling — the model sees the full visual context of the table, including cell borders and alignment cues

50+ languages out of the box

RD-TableBench isn't the first table extraction benchmark, but it addresses key limitations of earlier datasets:

Benchmark	Size	Source	Labels	Diversity
PubTabNet	568K tables	PubMed Central	Programmatic (from XML)	Low — academic papers only, mostly English
FinTabNet	113K tables	SEC filings	Programmatic (from HTML)	Low — financial documents, English only
ICDAR-2013	156 tables	Government/EU documents	Human-annotated	Moderate — small dataset, European languages
RD-TableBench	1,000 tables	Diverse public documents	PhD-level human annotation	High — 20+ languages, scanned/handwritten/complex

For more context, see Reducto's discussion of prior work.

Reproduce Our Results

You can run the same benchmark yourself. Our evaluation code is open-source at github.com/Doc-LD/rd-tablebench and uses the same dataset and grading methodology.

Run the benchmark yourself

Step 1 — Clone the evaluation repo:

git clone https://github.com/Doc-LD/rd-tablebench.git
cd rd-tablebench
npm install

Step 2 — Download the dataset:

npm run download

This downloads the RD-TableBench dataset from HuggingFace and prepares it in benchmark/data/.

Step 3 — Score your predictions:

npm run score -- --predictions ./your-predictions --output results.json

Step 4 — Run the scorer tests:

npm test

Validates that the TypeScript scorer produces correct results on known inputs.

You can also try DocLD table extraction on your own documents:

Dashboard: Upload a document in Extract, enable "Agentic extraction (complex docs)" in settings
API: Pass config.settings.agenticExtractionMode: true — see the API docs

What This Means for Your Documents

Benchmark scores are useful, but what matters is whether the tool works on your documents. Here's how the RD-TableBench results translate to real-world impact across common document types:

Insurance forms and claims. Mix of printed and handwritten entries in structured grids. The benchmark's "handwritten content" category (DocLD 84.1% vs. Reducto 79.8%) tests exactly this scenario.

If your documents have tables, DocLD handles them better than any other tool we've tested — and now there's a reproducible, third-party benchmark to prove it.

Data Sources and Methodology Links

For full transparency, here are all the external resources we used:

Resource	Link
DocLD evaluation code	github.com/Doc-LD/rd-tablebench
RD-TableBench announcement	Reducto blog
Full dataset with labels	HuggingFace: reducto/rd-tablebench
Grading code (Needleman-Wunsch)	grading.py on GitHub
Table conversion code	convert.py on GitHub
Provider implementations	providers/ on GitHub
Needleman-Wunsch algorithm	Wikipedia
Levenshtein distance	Wikipedia

Why We Used Reducto's Own Benchmark

The Results

About the Data

How the data was collected

Language diversity

Why this dataset matters

Evaluation Methodology

Table representation

Hierarchical alignment with Needleman-Wunsch

Cell normalization

Invocation

Score Distribution Analysis

What the distributions reveal

Tail cases

Where the Gap Is Largest

Why merged cells and headers matter

How DocLD Does It

Stage 1: Document input and page rendering

Stage 2: Table detection

Stage 3: VLM extraction

Stage 4: Structure validation

Stage 5: 2D array output

Agentic multi-pass processing

50+ languages out of the box

Prior Work and Related Benchmarks

Reproduce Our Results

What This Means for Your Documents

Data Sources and Methodology Links

Frequently Asked Questions

Why We Used Reducto's Own Benchmark

The Results

About the Data

How the data was collected

Language diversity

Why this dataset matters

Evaluation Methodology

Table representation

Hierarchical alignment with Needleman-Wunsch

Cell normalization

Invocation

Score Distribution Analysis

What the distributions reveal

Tail cases

Where the Gap Is Largest

Why merged cells and headers matter

How DocLD Does It

Stage 1: Document input and page rendering

Stage 2: Table detection

Stage 3: VLM extraction

Stage 4: Structure validation

Stage 5: 2D array output

Agentic multi-pass processing

50+ languages out of the box

Prior Work and Related Benchmarks

Reproduce Our Results

What This Means for Your Documents

Data Sources and Methodology Links

Frequently Asked Questions