fix(readable): install a StringDecoder when setEncoding is called#5086
fix(readable): install a StringDecoder when setEncoding is called#5086SAY-5 wants to merge 1 commit intonodejs:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5086 +/- ##
=======================================
Coverage 93.13% 93.13%
=======================================
Files 110 110
Lines 35816 35824 +8
=======================================
+ Hits 33356 33364 +8
Misses 2460 2460 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
mcollina
left a comment
There was a problem hiding this comment.
Can you check why was this done in this way? Also, why are we overriding setEncoding now? Shouldn't removing the override do the same?
|
Good point @mcollina — you're right that removing the override is the cleaner fix. |
Per @mcollina's review: the original BodyReadable.setEncoding override only wrote `_readableState.encoding`, which made the base Readable fall back to Buffer.prototype.toString(encoding) per chunk -- the exact failure mode that produced U+FFFD on multi-byte UTF-8 chunks straddling boundaries (nodejs#5002). node:stream's Readable.prototype.setEncoding already installs a StringDecoder and buffers partial byte sequences between chunks. Removing the override is the minimal fix and inherits the correct behaviour. Dropped the now-unused `node:string_decoder` import. Regression test test/readable-set-encoding-utf8.js still passes.
d881946 to
c7f1904
Compare
What kind of change does this PR introduce?
bug fix
What is the current behavior?
BodyReadable.setEncodingonly wrote_readableState.encodingand skipped the Node-standard step of installing aStringDecoder. With an encoding set but no decoder, theReadablemachinery falls through toBuffer.prototype.toString(encoding)on each individual chunk, which corrupts any multi-byte character (UTF-8 CJK, emoji, any 3+ byte sequence) that straddles a chunk boundary. The partial bytes become U+FFFD, and the next chunk's leading bytes do too.Fixes #5002.
What is the new behavior?
Match Node's
Readable.prototype.setEncoding: build aStringDecoderfrom the supplied encoding and wire it into_readableState. The decoder buffers unfinished byte sequences between chunks, so callers iteratingfor await (const chunk of body)withsetEncoding('utf8')get lossless text.Other information
Added
test/readable-set-encoding-utf8.js: pushes the 3-byte sequence for中split across two chunks and asserts noU+FFFDin the decoded output. Pre-fix the test fails with"\uFFFD\uFFFD"; post-fix the decoded string is exactly"中".