Skip to content

Support ActualText in marked content for text extraction#587

Merged
yob merged 1 commit intoyob:mainfrom
55728:support-actual-text-in-marked-content
Apr 6, 2026
Merged

Support ActualText in marked content for text extraction#587
yob merged 1 commit intoyob:mainfrom
55728:support-actual-text-in-marked-content

Conversation

@55728
Copy link
Copy Markdown
Contributor

@55728 55728 commented Apr 3, 2026

Problem

PDFs generated by some tools (e.g. Chrome/Gotenberg) with OpenType font feature
substitutions such as font-feature-settings: 'tnum' (tabular numerals) produce
ToUnicode CMaps that map substituted glyph IDs to U+0000. This causes
page.text to return \u0000 instead of the actual characters.

The content stream does contain the correct text via ActualText attributes in
marked content spans (BDC/EMC operators), but pdf-reader currently ignores
marked content.

Root Cause Analysis

When tnum is enabled, the OpenType GSUB table substitutes proportional numeral
glyphs (e.g. GID 132 = '0') with tabular numeral glyphs (e.g. GID 142). The PDF
generator writes a ToUnicode CMap that maps these new GIDs to U+0000:

6 beginbfchar
<008E> <0000>   ← GID 142, should be '0'
<008F> <0000>   ← GID 143, should be '1'
...

The substituted GIDs also don't appear in the embedded font's cmap or post tables,
leaving no font-level fallback path.

However, the content stream wraps each glyph in a marked content span with the
correct text:

/Span<</ActualText (2) >> BDC
<0090> Tj
EMC

Solution

Implement support for ActualText in marked content spans per
PDF 32000-1:2008 §14.9.4:

  • Handle begin_marked_content_with_pl (BDC) and end_marked_content (EMC)
    callbacks in PageTextReceiver
  • When ActualText is present in a BDC span, use it instead of the
    ToUnicode-derived text
  • Handle both PDFDocEncoding and UTF-16BE (BOM-prefixed) ActualText values
  • Within a span, emit ActualText once and skip duplicate glyph text output
    while still advancing glyph displacement for correct positioning

Test

Added font_features_on.pdf and an integration spec that verifies
"21.09.2023" is correctly extracted from a PDF using tnum font features.

Note

While investigating this issue, I also looked into the font-level fallback paths (cmap reverse lookup, post table glyph names). In the attached PDF, the substituted GIDs (142–145, 151) are absent from both the embedded font's cmap and post tables, confirming that ActualText is the only reliable source of truth for GSUB-substituted glyphs.

This investigation built on my experience with CFF font subsetting in ttfunk (prawnpdf/ttfunk#116), where I fixed related issues with .notdef handling and charset encoding in CID-keyed fonts.

Fixes #523

@yob
Copy link
Copy Markdown
Owner

yob commented Apr 4, 2026

Thanks, this is an interesting contribution. CI shows that the test suite is green on ruby 2.6+ and jruby 9.3+, but this project aims to work on ruby 2.1+ and jruby 9.1.

Could you adjust the code to work on the older versions as well?

@55728
Copy link
Copy Markdown
Contributor Author

55728 commented Apr 4, 2026

Good catch, thanks! The issue was an endless range ([2..]) which requires Ruby 2.6+. Fixed in the latest push — replaced with [2..-1] for Ruby 2.1+ compatibility.

@yob
Copy link
Copy Markdown
Owner

yob commented Apr 4, 2026 via email

@55728
Copy link
Copy Markdown
Contributor Author

55728 commented Apr 4, 2026

Yes, exactly — the implementation is general purpose. When ActualText is present in a BDC/EMC marked content span, it takes precedence over the ToUnicode-derived text regardless of the cause of the bad CMap.

Per PDF 32000-1:2008 §14.9.4, ActualText is defined as the authoritative replacement text for the enclosed content, so preferring it over ToUnicode is the spec-correct behavior.

The PR description focused on the tnum case because that was the specific scenario reported in #523, but the fix applies to any PDF where ActualText is used to provide correct text for glyphs that can't be resolved through font-level mappings.

Copy link
Copy Markdown
Owner

@yob yob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keen to accept this, supporting marked content would be a nice improvement for users extracting text.

Please rename the spec file to and adjust the integration spec to make it clear this is about support for recognised marked content, not "font features"

Comment on lines +174 to +178
if text.b.start_with?("\xFE\xFF".b)
utf8_chars = text.b[2..-1].force_encoding('UTF-16BE').encode('UTF-8')
else
utf8_chars = text.dup.force_encoding('UTF-8')
end
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you rebase this branch on main, there is a EncodingUtils.string_to_utf8 method than can do this encoding dance for you

@55728 55728 force-pushed the support-actual-text-in-marked-content branch 2 times, most recently from c1d987d to 3343b1b Compare April 5, 2026 01:51
@55728
Copy link
Copy Markdown
Contributor Author

55728 commented Apr 5, 2026

Rebased on main and refactored to use the new EncodingUtils.string_to_utf8 — thanks for extracting that, it's
a much cleaner approach. Also renamed the spec file and context to focus on marked content rather than font features, and squashed into a single commit.

@55728 55728 requested a review from yob April 5, 2026 01:52
Comment on lines +176 to +180
else
glyph_width = @state.current_font.glyph_width_in_text_space(glyph_code)
@state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE)
next
end
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe the purpose of the else block? What happens if we remove it and let the control flow continue as normal, just with the value of utf8_chars no replaced by the marked content ActualText?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else block handles subsequent glyphs within the same BDC/EMC span after ActualText has already been emitted. Since ActualText replaces all glyph text in the span, these glyphs should contribute only their displacement (positioning), not text.

You're right that removing the else block and letting control flow continue works — setting utf8_chars to an empty string achieves the same result more simply. Updated.

When ActualText is present in a BDC/EMC marked content span, use it instead of ToUnicode-derived text for text extraction.
This handles PDFs where font feature substitutions (e.g. tnum) produce incorrect ToUnicode CMap entries.

Uses EncodingUtils.string_to_utf8 for ActualText encoding conversion per PDF 32000-1:2008 §14.9.4.

Fixes yob#523
@55728 55728 force-pushed the support-actual-text-in-marked-content branch from 3343b1b to c36bb9a Compare April 5, 2026 06:18
@55728 55728 requested a review from yob April 5, 2026 06:18
@yob yob merged commit 78285f5 into yob:main Apr 6, 2026
1 check passed
@yob
Copy link
Copy Markdown
Owner

yob commented Apr 6, 2026

I think there's a reasonable chance we'll hit some edge cases with this, but the test suite is green so I'll merge and if edge cases emerge they can be fixed.

@55728 55728 deleted the support-actual-text-in-marked-content branch April 6, 2026 11:32
@55728
Copy link
Copy Markdown
Contributor Author

55728 commented Apr 6, 2026

Thanks for merging! Happy to follow up on any edge cases that come
up — feel free to ping me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Numerals read as \u0000 when using font feature settings

2 participants