Support ActualText in marked content for text extraction by 55728 · Pull Request #587 · yob/pdf-reader

55728 · 2026-04-03T17:27:21Z

Problem

PDFs generated by some tools (e.g. Chrome/Gotenberg) with OpenType font feature
substitutions such as font-feature-settings: 'tnum' (tabular numerals) produce
ToUnicode CMaps that map substituted glyph IDs to U+0000. This causes
page.text to return \u0000 instead of the actual characters.

The content stream does contain the correct text via ActualText attributes in
marked content spans (BDC/EMC operators), but pdf-reader currently ignores
marked content.

Root Cause Analysis

When tnum is enabled, the OpenType GSUB table substitutes proportional numeral
glyphs (e.g. GID 132 = '0') with tabular numeral glyphs (e.g. GID 142). The PDF
generator writes a ToUnicode CMap that maps these new GIDs to U+0000:

6 beginbfchar
<008E> <0000>   ← GID 142, should be '0'
<008F> <0000>   ← GID 143, should be '1'
...

The substituted GIDs also don't appear in the embedded font's cmap or post tables,
leaving no font-level fallback path.

However, the content stream wraps each glyph in a marked content span with the
correct text:

/Span<</ActualText (2) >> BDC
<0090> Tj
EMC

Solution

Implement support for ActualText in marked content spans per
PDF 32000-1:2008 §14.9.4:

Handle begin_marked_content_with_pl (BDC) and end_marked_content (EMC)
callbacks in PageTextReceiver
When ActualText is present in a BDC span, use it instead of the
ToUnicode-derived text
Handle both PDFDocEncoding and UTF-16BE (BOM-prefixed) ActualText values
Within a span, emit ActualText once and skip duplicate glyph text output
while still advancing glyph displacement for correct positioning

Test

Added font_features_on.pdf and an integration spec that verifies
"21.09.2023" is correctly extracted from a PDF using tnum font features.

Note

While investigating this issue, I also looked into the font-level fallback paths (cmap reverse lookup, post table glyph names). In the attached PDF, the substituted GIDs (142–145, 151) are absent from both the embedded font's cmap and post tables, confirming that ActualText is the only reliable source of truth for GSUB-substituted glyphs.

This investigation built on my experience with CFF font subsetting in ttfunk (prawnpdf/ttfunk#116), where I fixed related issues with .notdef handling and charset encoding in CID-keyed fonts.

Fixes #523

yob · 2026-04-04T11:58:07Z

Thanks, this is an interesting contribution. CI shows that the test suite is green on ruby 2.6+ and jruby 9.3+, but this project aims to work on ruby 2.1+ and jruby 9.1.

Could you adjust the code to work on the older versions as well?

55728 · 2026-04-04T12:37:10Z

Good catch, thanks! The issue was an endless range ([2..]) which requires Ruby 2.6+. Fixed in the latest push — replaced with [2..-1] for Ruby 2.1+ compatibility.

yob · 2026-04-04T12:42:47Z

It’s curious that the PR description focus on a very specific issue that can cause a font in a source PDF to have bad ToUnicodr cmaps. There’s probably thousands of causes that could lead to bad CMaps, is this fix relevant to all of them (when marked text is available as an alternative)? Does that make this a general purpose solution that means pdf-reader will prefer marked text for text extraction when available?

…

On Sat, 4 Apr 2026 at 23:37, Kenta Ishizaki ***@***.***> wrote: *55728* left a comment (yob/pdf-reader#587) <#587 (comment)> Good catch, thanks! The issue was an endless range ([2..]) which requires Ruby 2.6+. Fixed in the latest push — replaced with [2..-1] for Ruby 2.1+ compatibility. — Reply to this email directly, view it on GitHub <#587 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAB7REETQFV5MBBY7FISFT4UD6YZAVCNFSM6AAAAACXL7WWO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCOBXGA2TOMBWGA> . You are receiving this because you commented.Message ID: ***@***.***>

55728 · 2026-04-04T12:47:46Z

Yes, exactly — the implementation is general purpose. When ActualText is present in a BDC/EMC marked content span, it takes precedence over the ToUnicode-derived text regardless of the cause of the bad CMap.

Per PDF 32000-1:2008 §14.9.4, ActualText is defined as the authoritative replacement text for the enclosed content, so preferring it over ToUnicode is the spec-correct behavior.

The PR description focused on the tnum case because that was the specific scenario reported in #523, but the fix applies to any PDF where ActualText is used to provide correct text for glyphs that can't be resolved through font-level mappings.

yob

I'm keen to accept this, supporting marked content would be a nice improvement for users extracting text.

Please rename the spec file to and adjust the integration spec to make it clear this is about support for recognised marked content, not "font features"

yob · 2026-04-05T00:09:50Z

lib/pdf/reader/page_text_receiver.rb

+              if text.b.start_with?("\xFE\xFF".b)
+                utf8_chars = text.b[2..-1].force_encoding('UTF-16BE').encode('UTF-8')
+              else
+                utf8_chars = text.dup.force_encoding('UTF-8')
+              end


if you rebase this branch on main, there is a EncodingUtils.string_to_utf8 method than can do this encoding dance for you

55728 · 2026-04-05T01:52:40Z

Rebased on main and refactored to use the new EncodingUtils.string_to_utf8 — thanks for extracting that, it's
a much cleaner approach. Also renamed the spec file and context to focus on marked content rather than font features, and squashed into a single commit.

yob · 2026-04-05T05:43:05Z

lib/pdf/reader/page_text_receiver.rb

+            else
+              glyph_width = @state.current_font.glyph_width_in_text_space(glyph_code)
+              @state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE)
+              next
+            end


Can you describe the purpose of the else block? What happens if we remove it and let the control flow continue as normal, just with the value of utf8_chars no replaced by the marked content ActualText?

The else block handles subsequent glyphs within the same BDC/EMC span after ActualText has already been emitted. Since ActualText replaces all glyph text in the span, these glyphs should contribute only their displacement (positioning), not text.

You're right that removing the else block and letting control flow continue works — setting utf8_chars to an empty string achieves the same result more simply. Updated.

When ActualText is present in a BDC/EMC marked content span, use it instead of ToUnicode-derived text for text extraction. This handles PDFs where font feature substitutions (e.g. tnum) produce incorrect ToUnicode CMap entries. Uses EncodingUtils.string_to_utf8 for ActualText encoding conversion per PDF 32000-1:2008 §14.9.4. Fixes yob#523

yob · 2026-04-06T01:28:10Z

I think there's a reasonable chance we'll hit some edge cases with this, but the test suite is green so I'll merge and if edge cases emerge they can be fixed.

55728 · 2026-04-06T11:33:07Z

Thanks for merging! Happy to follow up on any edge cases that come
up — feel free to ping me.

yob reviewed Apr 4, 2026

View reviewed changes

yob reviewed Apr 5, 2026

View reviewed changes

55728 force-pushed the support-actual-text-in-marked-content branch 2 times, most recently from c1d987d to 3343b1b Compare April 5, 2026 01:51

55728 requested a review from yob April 5, 2026 01:52

yob reviewed Apr 5, 2026

View reviewed changes

55728 force-pushed the support-actual-text-in-marked-content branch from 3343b1b to c36bb9a Compare April 5, 2026 06:18

55728 requested a review from yob April 5, 2026 06:18

yob merged commit 78285f5 into yob:main Apr 6, 2026
1 check passed

55728 deleted the support-actual-text-in-marked-content branch April 6, 2026 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support ActualText in marked content for text extraction#587

Support ActualText in marked content for text extraction#587
yob merged 1 commit intoyob:mainfrom
55728:support-actual-text-in-marked-content

55728 commented Apr 3, 2026 •

edited

Loading

Uh oh!

yob commented Apr 4, 2026

Uh oh!

55728 commented Apr 4, 2026

Uh oh!

yob commented Apr 4, 2026 via email

Uh oh!

55728 commented Apr 4, 2026

Uh oh!

yob left a comment

Uh oh!

yob Apr 5, 2026

Uh oh!

55728 commented Apr 5, 2026

Uh oh!

yob Apr 5, 2026

Uh oh!

55728 Apr 5, 2026

Uh oh!

Uh oh!

yob commented Apr 6, 2026

Uh oh!

55728 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

55728 commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause Analysis

Solution

Test

Note

Uh oh!

yob commented Apr 4, 2026

Uh oh!

55728 commented Apr 4, 2026

Uh oh!

yob commented Apr 4, 2026 via email

Uh oh!

55728 commented Apr 4, 2026

Uh oh!

yob left a comment

Choose a reason for hiding this comment

Uh oh!

yob Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

55728 commented Apr 5, 2026

Uh oh!

yob Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

55728 Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yob commented Apr 6, 2026

Uh oh!

55728 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

55728 commented Apr 3, 2026 •

edited

Loading