Skip to content

FEATURE: Optional strip_trailing_invisibles config flag#39

Merged
gschlager merged 1 commit into
mainfrom
drop-invisibles-at-block-boundaries
May 11, 2026
Merged

FEATURE: Optional strip_trailing_invisibles config flag#39
gschlager merged 1 commit into
mainfrom
drop-invisibles-at-block-boundaries

Conversation

@gschlager
Copy link
Copy Markdown
Member

@gschlager gschlager commented May 11, 2026

Summary

Outlook/Word HTML exports frequently leave zero-width soft-break hints and nbsp spacers at the end of paragraph content. Without trimming, these survive into the rendered Markdown as invisible artifacts — visually identical in the final rendered HTML, but confusing for anyone editing the stored Markdown later or trying to diff/grep it.

Adds a strip_trailing_invisibles config flag (default false, opt-in) that has cleanup_markdown rstrip the invisibles set at every line end when enabled.

Markbridge.configure { |c| c.strip_trailing_invisibles = true }

Set members:

Code point Name
U+00A0 NO-BREAK SPACE
U+200B ZERO WIDTH SPACE
U+200C ZERO WIDTH NON-JOINER
U+200D ZERO WIDTH JOINER
U+2060 WORD JOINER
U+FEFF ZERO WIDTH NO-BREAK SPACE / BOM

Design choices

  • Default off. Matches the existing escape_hard_line_breaks: false precedent ("preserve more semantics by default"). No behaviour change for current consumers. High-throughput migration paths pay nothing extra; consumers who care about clean stored output opt in.
  • ASCII space and tab excluded. Markdown's "two trailing spaces → hard line break" rule requires ASCII U+0020s at line ends; touching those would break real semantics.
  • Leading invisibles untouched. A leading nbsp may be deliberate indentation; cleanup_markdown is not the place to second-guess author intent.
  • Mid-content invisibles untouched. ZWSP between characters in long unbreakable strings or in CJK is meaningful as a soft-break hint.
  • Universal when enabled. cleanup_markdown runs at the end of every conversion (html_to_markdown, bbcode_to_markdown, …), so the rule applies regardless of source language. Custom block tags registered by downstream consumers inherit the behaviour with no opt-in required.

The Outlook spacer paragraph case (<p>&nbsp;</p> between two paragraphs) drops out indirectly when enabled: rstripping the nbsp leaves an empty line, which the existing gsub(/\n{3,}/, "\n\n") collapse plus the trailing .strip swallow.

Performance

Mixed block + inline HTML benchmark (ruby 3.4.8 +YJIT, 5 runs avg):

i/s
main / flag off ~3.27k
flag on ~3.13k

Flag on costs ~4-5% from the extra gsub on the final rendered output. Within typical ±5-10% run-to-run variance on this benchmark.

Test plan

  • bundle exec rspec — 3092 examples, 0 failures
  • bin/lint — no offenses
  • bin/mutant run on Markbridge.cleanup_markdown + Markbridge::Configuration* — 100% mutation coverage
  • Default behaviour: trailing ZWSP preserved (byte-level verified)
  • Flag on: real-world Outlook spacer + trailing ZWSP cases drop out
  • Flag on: trailing ASCII spaces preserved (Markdown hard-break syntax intact)
  • Flag on: leading invisibles preserved, mid-content invisibles preserved

@gschlager gschlager changed the title FIX: Strip trailing invisible characters per line in cleanup_markdown FEATURE: Optional strip_trailing_invisibles config flag May 11, 2026
@gschlager gschlager force-pushed the drop-invisibles-at-block-boundaries branch 2 times, most recently from 8b2c59e to 47dad4c Compare May 11, 2026 16:12
Outlook/Word HTML exports frequently leave zero-width soft-break hints
and nbsp spacers at the end of paragraph content. Without trimming,
these survive into the rendered Markdown as invisible artifacts —
visually identical in the final rendered HTML, but confusing for
anyone editing the stored Markdown later or trying to diff/grep it.

Add a `strip_trailing_invisibles` config flag (default false) that,
when enabled, has `cleanup_markdown` rstrip the invisible set at every
line end. Members: NBSP (U+00A0), ZWSP (U+200B), ZWNJ (U+200C),
ZWJ (U+200D), WJ (U+2060), ZWNBSP/BOM (U+FEFF). Excludes ASCII space
and tab so Markdown's two-trailing-spaces hard-break syntax keeps
working.

```ruby
Markbridge.configure { |c| c.strip_trailing_invisibles = true }
```

Default-off matches the existing `escape_hard_line_breaks` precedent
("preserve more semantics by default") and means no behavior change
for current consumers. High-throughput migration paths pay nothing
extra; consumers who care about clean stored output opt in for the
~4-5% slowdown.

The Outlook spacer paragraph case (`<p>&nbsp;</p>` between two
paragraphs) drops out indirectly when enabled: rstripping the nbsp
leaves an empty line, which the existing `gsub(/\n{3,}/, "\n\n")`
collapse plus the trailing `.strip` swallow.

Benchmark, mixed block + inline HTML (ruby 3.4.8 +YJIT, 5 runs avg):
- main / flag off:  ~3.27k i/s
- flag on:          ~3.13k i/s  (-4.5%)

Within typical run-to-run variance on this benchmark.
@gschlager gschlager force-pushed the drop-invisibles-at-block-boundaries branch from 47dad4c to 0fcd1bf Compare May 11, 2026 16:14
@gschlager gschlager merged commit 3571212 into main May 11, 2026
7 checks passed
@gschlager gschlager deleted the drop-invisibles-at-block-boundaries branch May 11, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant