Evaluating OCR-to-Markdown Techniques Is Essentially Damaged (and Why That’s Exhausting to Repair)

January 19, 2026

25

Evaluating OCR programs that convert PDFs or doc photographs into Markdown is way extra complicated than it seems. Not like plain textual content OCR, OCR-to-Markdown requires fashions to get well content material, structure, studying order, and illustration selections concurrently. Right this moment’s benchmarks try to attain this with a mixture of string matching, heuristic alignment, and format-specific guidelines—however in apply, these approaches routinely misclassify appropriate outputs as failures.

This publish outlines why OCR-to-Markdown analysis is inherently underspecified, examines widespread analysis methods and their failure modes, highlights concrete points noticed in two broadly used benchmarks, and explains why LLM-as-judge is presently essentially the most sensible technique to consider these programs—regardless of its imperfections .

Why OCR-to-Markdown Is Exhausting to Consider

At its core, OCR-to-Markdown doesn’t have a single appropriate output.

A number of outputs might be equally legitimate:

Multi-column layouts might be linearized in several studying orders.
Equations might be represented utilizing LaTeX, Unicode, HTML, or hybrids.
Headers, footers, watermarks, and marginal textual content might or is probably not thought of “content material” relying on process intent.
Spacing, punctuation, and Unicode normalization usually differ with out affecting that means.

From a human or downstream-system perspective, these outputs are equal. From a benchmark’s perspective, they usually aren’t.

Widespread Analysis Methods and Their Limitations

1. String-Primarily based Metrics (Edit Distance, Precise Match)

Most OCR-to-Markdown benchmarks depend on normalized string comparability or edit distance.

Limitations

Markdown is handled as a flat character sequence, ignoring construction.
Minor formatting variations produce giant penalties.
Structurally incorrect outputs can rating nicely if textual content overlaps.
Scores correlate poorly with human judgment.

These metrics reward formatting compliance relatively than correctness.

2. Order-Delicate Block Matching

Some benchmarks phase paperwork into blocks and rating ordering and proximity.

Limitations

Legitimate various studying orders (e.g., multi-column paperwork) are penalized.
Small footer or marginal textual content can break strict ordering constraints.
Matching heuristics degrade quickly as structure complexity will increase.

Right content material is usually marked unsuitable on account of ordering assumptions.

3. Equation Matching by way of LaTeX Normalization

Math-heavy benchmarks usually anticipate equations to be rendered as full LaTeX.

Limitations

Unicode or partially rendered equations are penalized.
Equal LaTeX expressions utilizing totally different macros fail to match.
Blended LaTeX/Markdown/HTML representations aren’t dealt with.
Rendering-correct equations nonetheless fail string-level checks.

This conflates illustration selection with mathematical correctness.

4. Format-Particular Assumptions

Benchmarks implicitly encode a most well-liked output fashion.

Limitations

HTML tags (e.g., ) trigger matching failures.
Unicode symbols (e.g., km²) are penalized towards LaTeX equivalents.
Spacing and punctuation inconsistencies in floor fact amplify errors.

Fashions aligned to benchmark formatting outperform extra basic OCR programs.

Points Noticed in Present Benchmarks

Benchmark A: olmOCRBench

Handbook inspection reveals that a number of subsets embed implicit content material omission guidelines:

Headers, footers, and watermarks which might be visibly current in paperwork are explicitly marked as absent in floor fact.
Fashions educated to extract all seen textual content are penalized for being appropriate.
These subsets successfully consider selective suppression, not OCR high quality.

Moreover:

Math-heavy subsets fail when equations aren’t totally normalized LaTeX.
Right predictions are penalized on account of illustration variations.

Consequently, scores strongly depend upon whether or not a mannequin’s output philosophy matches the benchmark’s hidden assumptions.

Instance 1

For the above picture, Nanonets-OCR2 accurately predicts the watermark to the fitting facet of the picture, however within the floor fact annotation penalizes the mannequin for predicting it accurately.

{
"pdf": "headers_footers/ef5e1f5960b9f865c8257f9ce4ff152a13a2559c_page_26.pdf", 
"web page": 1, 
"id": "ef5e1f5960b9f865c8257f9ce4ff152a13a2559c_page_26.pdf_manual_01", 
"kind": "absent", 
"textual content": "Doc tu00e9lu00e9chargu00e9 depuis www.cairn.information - Universitu00e9 de Marne-la-Vallu00e9e - - 193.50.159.70 - 20/03/2014 09h07. u00a9 S.A.C.", "case_sensitive": false, "max_diffs": 3, "checked": "verified", "first_n": null, "last_n": null, "url": ""}

Sort absent implies that within the prediction knowledge, that textual content shouldn’t be current.

Instance 2

The benchmark additionally doesn’t think about texts which might be current within the doc footer.

Instance on this doc, the Alcoholics Namelessu00ae and www.aa.org shouldn’t be current within the doc in keeping with the ground-truth, which is wrong

{
	"pdf": "headers_footers/3754542bf828b42b268defe21db8526945928834_page_4.pdf", 
	"web page": 1, 
	"id": "3754542bf828b42b268defe21db8526945928834_page_4_header_00", 
	"kind": "absent", 
	"max_diffs": 0, 
	"checked": "verified", 
	"url": "", 
	"textual content": "Alcoholics Namelessu00ae", 
	"case_sensitive": false, "first_n": null, "last_n": null
	}
{
	"pdf": "headers_footers/3754542bf828b42b268defe21db8526945928834_page_4.pdf", 
	"web page": 1, 
	"id": "3754542bf828b42b268defe21db8526945928834_page_4_header_01", 
	"kind": "absent", 
	"max_diffs": 0, 
	"checked": "verified", 
	"url": "", 
	"textual content": "www.aa.org", 
	"case_sensitive": false, "first_n": null, "last_n": null}

Benchmark B: OmniDocBench

OmniDocBench reveals related points, however extra broadly:

Equation analysis depends on strict LaTeX string equivalence.
Semantically an identical equations fail on account of macro, spacing, or image variations.
Quite a few ground-truth annotation errors have been noticed (lacking tokens, malformed math, incorrect spacing).
Unicode normalization and spacing variations systematically scale back scores.
Prediction choice heuristics can fail even when the right reply is totally current.

In lots of circumstances, low scores replicate benchmark artifacts, not mannequin errors.

Instance 1

Within the instance above, the Nanonets-OCR2-3B predicts 5 g silica + 3 g Al$_2$O$_3$ however the floor fact expects as $ 5g mathrm{ s i l i c a}+3g mathrm{ A l}*{2} mathrm{O*{3}} $ . This flags the mannequin prediction as incorrect, even when each are appropriate.

Full Floor Reality and Prediction, and the check case shared beneath:

'pred': 'The collected eluant was concentrated by rotary evaporator to 1 ml. The extracts have been lastly handed by means of a last column stuffed with 5 g silica + 3 g Al$_2$O$_3$ to take away any co-extractive compounds that will trigger instrumental interferences durin the evaluation. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the primary 18 ml of eluent was discarded and the remainder have been collected, which accommodates the analytes of curiosity. The extract was exchanged into n-hexane, concentrated to 1 ml to which 1 μg/ml of inside customary was added.'
'gt': 'The collected eluant was concentrated by rotary evaporator to 1 ml .The extracts have been lastly handed by means of a last column stuffed with $ 5g mathrm{ s i l i c a}+3g mathrm{ A l}*{2} mathrm{O*{3}} $ to take away any co-extractive compounds that will trigger instrumental
interferences throughout the evaluation. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the primary 18 ml of eluent was discarded and the remainder have been collected, which accommodates the analytes of curiosity. The extract was exchanged into n - hexane, concentrated to 1 ml to which $ mumathrm{g / ml} $ of inside customary was added.'

Instance 2

We discovered considerably extra incorrect annotations with OmniDocBench

Within the ground-truth annotation 1 is lacking in 1 ml .

'textual content': 'The collected eluant was concentrated by rotary evaporator to 1 ml .The extracts have been lastly handed by means of a last column stuffed with $ 5g mathrm{ s i l i c a}+3g mathrm{ A l}*{2} mathrm{O*{3}} $ to take away any co-extractive compounds that will trigger instrumental interferences throughout the evaluation. The extract was eluted with 120 ml of DCM:n-hexane (1:1), the primary 18 ml of eluent was discarded and the remainder have been collected, which accommodates the analytes of curiosity. The extract was exchanged into n - hexane, concentrated to 1 ml to which $ mumathrm{g / ml} $ of inside customary was added.'

Evaluating OCR-to-Markdown Techniques Is Essentially Damaged (and Why That’s Exhausting to Repair)

Why OCR-to-Markdown Is Exhausting to Consider

Widespread Analysis Methods and Their Limitations

1. String-Primarily based Metrics (Edit Distance, Precise Match)

2. Order-Delicate Block Matching

3. Equation Matching by way of LaTeX Normalization

4. Format-Particular Assumptions

Points Noticed in Present Benchmarks

Benchmark A: olmOCRBench

Benchmark B: OmniDocBench

Related Articles

The Inhabitants Bomb By no means Went Off. Why Did We Imagine It Would?

The nice robotic race: How corporations can steadiness velocity to market and compliance within the U.S.

Delve accused of deceptive clients with ‘pretend compliance’

Latest Articles

The Inhabitants Bomb By no means Went Off. Why Did We Imagine It Would?

The nice robotic race: How corporations can steadiness velocity to market and compliance within the U.S.

Delve accused of deceptive clients with ‘pretend compliance’

The Inhabitants Bomb By no means Went Off. Why Did We...