FEATURE: Optional strip_trailing_invisibles config flag#39
Merged
Conversation
8b2c59e to
47dad4c
Compare
Outlook/Word HTML exports frequently leave zero-width soft-break hints
and nbsp spacers at the end of paragraph content. Without trimming,
these survive into the rendered Markdown as invisible artifacts —
visually identical in the final rendered HTML, but confusing for
anyone editing the stored Markdown later or trying to diff/grep it.
Add a `strip_trailing_invisibles` config flag (default false) that,
when enabled, has `cleanup_markdown` rstrip the invisible set at every
line end. Members: NBSP (U+00A0), ZWSP (U+200B), ZWNJ (U+200C),
ZWJ (U+200D), WJ (U+2060), ZWNBSP/BOM (U+FEFF). Excludes ASCII space
and tab so Markdown's two-trailing-spaces hard-break syntax keeps
working.
```ruby
Markbridge.configure { |c| c.strip_trailing_invisibles = true }
```
Default-off matches the existing `escape_hard_line_breaks` precedent
("preserve more semantics by default") and means no behavior change
for current consumers. High-throughput migration paths pay nothing
extra; consumers who care about clean stored output opt in for the
~4-5% slowdown.
The Outlook spacer paragraph case (`<p> </p>` between two
paragraphs) drops out indirectly when enabled: rstripping the nbsp
leaves an empty line, which the existing `gsub(/\n{3,}/, "\n\n")`
collapse plus the trailing `.strip` swallow.
Benchmark, mixed block + inline HTML (ruby 3.4.8 +YJIT, 5 runs avg):
- main / flag off: ~3.27k i/s
- flag on: ~3.13k i/s (-4.5%)
Within typical run-to-run variance on this benchmark.
47dad4c to
0fcd1bf
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Outlook/Word HTML exports frequently leave zero-width soft-break hints and nbsp spacers at the end of paragraph content. Without trimming, these survive into the rendered Markdown as invisible artifacts — visually identical in the final rendered HTML, but confusing for anyone editing the stored Markdown later or trying to diff/grep it.
Adds a
strip_trailing_invisiblesconfig flag (default false, opt-in) that hascleanup_markdownrstrip the invisibles set at every line end when enabled.Set members:
Design choices
escape_hard_line_breaks: falseprecedent ("preserve more semantics by default"). No behaviour change for current consumers. High-throughput migration paths pay nothing extra; consumers who care about clean stored output opt in.cleanup_markdownis not the place to second-guess author intent.cleanup_markdownruns at the end of every conversion (html_to_markdown,bbcode_to_markdown, …), so the rule applies regardless of source language. Custom block tags registered by downstream consumers inherit the behaviour with no opt-in required.The Outlook spacer paragraph case (
<p> </p>between two paragraphs) drops out indirectly when enabled: rstripping the nbsp leaves an empty line, which the existinggsub(/\n{3,}/, "\n\n")collapse plus the trailing.stripswallow.Performance
Mixed block + inline HTML benchmark (ruby 3.4.8 +YJIT, 5 runs avg):
Flag on costs ~4-5% from the extra
gsubon the final rendered output. Within typical ±5-10% run-to-run variance on this benchmark.Test plan
bundle exec rspec— 3092 examples, 0 failuresbin/lint— no offensesbin/mutant runonMarkbridge.cleanup_markdown+Markbridge::Configuration*— 100% mutation coverage