Support ActualText in marked content for text extraction#587
Conversation
|
Thanks, this is an interesting contribution. CI shows that the test suite is green on ruby 2.6+ and jruby 9.3+, but this project aims to work on ruby 2.1+ and jruby 9.1. Could you adjust the code to work on the older versions as well? |
|
Good catch, thanks! The issue was an endless range ( |
|
It’s curious that the PR description focus on a very specific issue that
can cause a font in a source PDF to have bad ToUnicodr cmaps.
There’s probably thousands of causes that could lead to bad CMaps, is this
fix relevant to all of them (when marked text is available as an
alternative)?
Does that make this a general purpose solution that means pdf-reader will
prefer marked text for text extraction when available?
…On Sat, 4 Apr 2026 at 23:37, Kenta Ishizaki ***@***.***> wrote:
*55728* left a comment (yob/pdf-reader#587)
<#587 (comment)>
Good catch, thanks! The issue was an endless range ([2..]) which requires
Ruby 2.6+. Fixed in the latest push — replaced with [2..-1] for Ruby 2.1+
compatibility.
—
Reply to this email directly, view it on GitHub
<#587 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAB7REETQFV5MBBY7FISFT4UD6YZAVCNFSM6AAAAACXL7WWO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCOBXGA2TOMBWGA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
|
Yes, exactly — the implementation is general purpose. When ActualText is present in a BDC/EMC marked content span, it takes precedence over the ToUnicode-derived text regardless of the cause of the bad CMap. Per PDF 32000-1:2008 §14.9.4, ActualText is defined as the authoritative replacement text for the enclosed content, so preferring it over ToUnicode is the spec-correct behavior. The PR description focused on the tnum case because that was the specific scenario reported in #523, but the fix applies to any PDF where ActualText is used to provide correct text for glyphs that can't be resolved through font-level mappings. |
yob
left a comment
There was a problem hiding this comment.
I'm keen to accept this, supporting marked content would be a nice improvement for users extracting text.
Please rename the spec file to and adjust the integration spec to make it clear this is about support for recognised marked content, not "font features"
lib/pdf/reader/page_text_receiver.rb
Outdated
| if text.b.start_with?("\xFE\xFF".b) | ||
| utf8_chars = text.b[2..-1].force_encoding('UTF-16BE').encode('UTF-8') | ||
| else | ||
| utf8_chars = text.dup.force_encoding('UTF-8') | ||
| end |
There was a problem hiding this comment.
if you rebase this branch on main, there is a EncodingUtils.string_to_utf8 method than can do this encoding dance for you
c1d987d to
3343b1b
Compare
|
Rebased on main and refactored to use the new EncodingUtils.string_to_utf8 — thanks for extracting that, it's |
| else | ||
| glyph_width = @state.current_font.glyph_width_in_text_space(glyph_code) | ||
| @state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE) | ||
| next | ||
| end |
There was a problem hiding this comment.
Can you describe the purpose of the else block? What happens if we remove it and let the control flow continue as normal, just with the value of utf8_chars no replaced by the marked content ActualText?
There was a problem hiding this comment.
The else block handles subsequent glyphs within the same BDC/EMC span after ActualText has already been emitted. Since ActualText replaces all glyph text in the span, these glyphs should contribute only their displacement (positioning), not text.
You're right that removing the else block and letting control flow continue works — setting utf8_chars to an empty string achieves the same result more simply. Updated.
When ActualText is present in a BDC/EMC marked content span, use it instead of ToUnicode-derived text for text extraction. This handles PDFs where font feature substitutions (e.g. tnum) produce incorrect ToUnicode CMap entries. Uses EncodingUtils.string_to_utf8 for ActualText encoding conversion per PDF 32000-1:2008 §14.9.4. Fixes yob#523
3343b1b to
c36bb9a
Compare
|
I think there's a reasonable chance we'll hit some edge cases with this, but the test suite is green so I'll merge and if edge cases emerge they can be fixed. |
|
Thanks for merging! Happy to follow up on any edge cases that come |
Problem
PDFs generated by some tools (e.g. Chrome/Gotenberg) with OpenType font feature
substitutions such as
font-feature-settings: 'tnum'(tabular numerals) produceToUnicode CMaps that map substituted glyph IDs to
U+0000. This causespage.textto return\u0000instead of the actual characters.The content stream does contain the correct text via
ActualTextattributes inmarked content spans (BDC/EMC operators), but pdf-reader currently ignores
marked content.
Root Cause Analysis
When
tnumis enabled, the OpenType GSUB table substitutes proportional numeralglyphs (e.g. GID 132 = '0') with tabular numeral glyphs (e.g. GID 142). The PDF
generator writes a ToUnicode CMap that maps these new GIDs to
U+0000:The substituted GIDs also don't appear in the embedded font's cmap or post tables,
leaving no font-level fallback path.
However, the content stream wraps each glyph in a marked content span with the
correct text:
Solution
Implement support for
ActualTextin marked content spans perPDF 32000-1:2008 §14.9.4:
begin_marked_content_with_pl(BDC) andend_marked_content(EMC)callbacks in
PageTextReceiverActualTextis present in a BDC span, use it instead of theToUnicode-derived text
while still advancing glyph displacement for correct positioning
Test
Added
font_features_on.pdfand an integration spec that verifies"21.09.2023"is correctly extracted from a PDF usingtnumfont features.Note
While investigating this issue, I also looked into the font-level fallback paths (cmap reverse lookup, post table glyph names). In the attached PDF, the substituted GIDs (142–145, 151) are absent from both the embedded font's cmap and post tables, confirming that ActualText is the only reliable source of truth for GSUB-substituted glyphs.
This investigation built on my experience with CFF font subsetting in ttfunk (prawnpdf/ttfunk#116), where I fixed related issues with .notdef handling and charset encoding in CID-keyed fonts.
Fixes #523