Skip to content

fix(readable): install a StringDecoder when setEncoding is called#5086

Closed
SAY-5 wants to merge 1 commit intonodejs:mainfrom
SAY-5:fix/readable-set-encoding-utf8-5002
Closed

fix(readable): install a StringDecoder when setEncoding is called#5086
SAY-5 wants to merge 1 commit intonodejs:mainfrom
SAY-5:fix/readable-set-encoding-utf8-5002

Conversation

@SAY-5
Copy link
Copy Markdown
Contributor

@SAY-5 SAY-5 commented Apr 21, 2026

What kind of change does this PR introduce?

bug fix

What is the current behavior?

BodyReadable.setEncoding only wrote _readableState.encoding and skipped the Node-standard step of installing a StringDecoder. With an encoding set but no decoder, the Readable machinery falls through to Buffer.prototype.toString(encoding) on each individual chunk, which corrupts any multi-byte character (UTF-8 CJK, emoji, any 3+ byte sequence) that straddles a chunk boundary. The partial bytes become U+FFFD, and the next chunk's leading bytes do too.

Fixes #5002.

What is the new behavior?

Match Node's Readable.prototype.setEncoding: build a StringDecoder from the supplied encoding and wire it into _readableState. The decoder buffers unfinished byte sequences between chunks, so callers iterating for await (const chunk of body) with setEncoding('utf8') get lossless text.

Other information

Added test/readable-set-encoding-utf8.js: pushes the 3-byte sequence for split across two chunks and asserts no U+FFFD in the decoded output. Pre-fix the test fails with "\uFFFD\uFFFD"; post-fix the decoded string is exactly "中".

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.13%. Comparing base (2a6f9c7) to head (6db976f).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5086   +/-   ##
=======================================
  Coverage   93.13%   93.13%           
=======================================
  Files         110      110           
  Lines       35816    35824    +8     
=======================================
+ Hits        33356    33364    +8     
  Misses       2460     2460           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check why was this done in this way? Also, why are we overriding setEncoding now? Shouldn't removing the override do the same?

@SAY-5
Copy link
Copy Markdown
Contributor Author

SAY-5 commented Apr 22, 2026

Good point @mcollina — you're right that removing the override is the cleaner fix. node:stream's Readable.prototype.setEncoding already installs a StringDecoder and buffers partial byte sequences between chunks; the pre-existing override only set _readableState.encoding and nothing else, which is exactly what caused #5002. Force-pushed to just delete the override (and the now-unused node:string_decoder import). Regression test still passes.

Per @mcollina's review: the original BodyReadable.setEncoding override
only wrote `_readableState.encoding`, which made the base Readable
fall back to Buffer.prototype.toString(encoding) per chunk -- the
exact failure mode that produced U+FFFD on multi-byte UTF-8 chunks
straddling boundaries (nodejs#5002).

node:stream's Readable.prototype.setEncoding already installs a
StringDecoder and buffers partial byte sequences between chunks.
Removing the override is the minimal fix and inherits the correct
behaviour.

Dropped the now-unused `node:string_decoder` import. Regression test
test/readable-set-encoding-utf8.js still passes.
@SAY-5 SAY-5 closed this Apr 23, 2026
@SAY-5 SAY-5 force-pushed the fix/readable-set-encoding-utf8-5002 branch from d881946 to c7f1904 Compare April 23, 2026 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

setEncoding('utf8') on response body corrupts multi-byte UTF-8 characters at chunk boundaries

3 participants