fix(ascend):update README#673
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds UTF-8-safe streaming output for Qwen Ascend: the tokenizer initializes with a UTF-8 locale, examples include a takeValidUtf8Prefix utility, the QA generation path gains a prefill warmup and buffered token decoding that prints only complete UTF-8 chunks, and README news/table entries are updated. ChangesUTF-8 Safe Streaming and Tokenizer Configuration
Sequence DiagramsequenceDiagram
participant Gen as Token Generator
participant Buf as UTF-8 Buffer
participant Val as UTF-8 Validator
participant Out as Output
loop Generation Loop
Gen->>Buf: Decode token → append to pending_text
Buf->>Val: Check for valid UTF-8 prefix
Val->>Buf: Return valid prefix + remainder
Buf->>Out: Print valid UTF-8 chunk
Buf->>Buf: Keep remainder for next iteration
end
Out->>Out: Print final newline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/qwen_ascend/main.cpp`:
- Around line 19-35: The helper takeValidUtf8Prefix currently returns empty when
an invalid byte is at the buffer start and never consumes it; update
takeValidUtf8Prefix to detect irrecoverable leading bytes (e.g., a start octet
that would require >4 continuation bytes or otherwise cannot form a valid UTF‑8
sequence) and consume/discard one invalid byte so progress continues, while
still returning empty when nothing valid precedes it; also ensure the final
flush path uses takeValidUtf8Prefix on pending_text (instead of printing
pending_text directly) so only validated UTF‑8 is emitted. Reference: function
takeValidUtf8Prefix and the final tail-flush that prints pending_text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1e278d7c-fbd2-4317-b44b-90eb3e3d5fd7
📒 Files selected for processing (3)
README.mdexamples/qwen_ascend/main.cppmllm/models/qwen_ascend/tokenization_qwen_ascend.hpp
| std::string takeValidUtf8Prefix(std::string& pending_text) { | ||
| auto invalid = utf8::find_invalid(pending_text.begin(), pending_text.end()); | ||
| if (invalid == pending_text.begin()) { | ||
| return {}; | ||
| } | ||
|
|
||
| if (invalid == pending_text.end()) { | ||
| std::string ready_text; | ||
| ready_text.swap(pending_text); | ||
| return ready_text; | ||
| } | ||
|
|
||
| auto ready_bytes = static_cast<size_t>(std::distance(pending_text.begin(), invalid)); | ||
| auto ready_text = pending_text.substr(0, ready_bytes); | ||
| pending_text.erase(0, ready_bytes); | ||
| return ready_text; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In utf8cpp, what does utf8::find_invalid(begin, end) return for (1) an incomplete multi-byte sequence at buffer start and (2) an irrecoverably invalid leading byte?
💡 Result:
For utf8::find_invalid(begin, end), the function returns an iterator pointing to the first invalid octet; if none are found it returns end. It does this by scanning from start and calling utf8::internal::validate_next(result, end) and immediately returning result when the returned error code is not UTF8_OK. [1] (1) Incomplete multi-byte sequence at buffer start: It returns begin (i.e., the iterator to the first byte of the incomplete sequence). validate_next is invoked starting from result==start, and any non-UTF8_OK (including INCOMPLETE_SEQUENCE) causes find_invalid to return the current iterator (the start of that invalid sequence). [1][2] (2) Irrecoverably invalid leading byte: It returns an iterator pointing to that invalid leading byte (the first byte whose validate_next call fails with a non-UTF8_OK error such as INVALID_LEAD). Since find_invalid returns the iterator value at which validate_next fails, it is the position of the bad leading octet. [1][2]
Citations:
- 1: https://github.com/nemtrif/utfcpp/blob/master/source/utf8/core.h
- 2: https://github.com/ledger/utfcpp/blob/master/source/utf8/checked.h
🏁 Script executed:
# Check if the file exists and get its size
wc -l examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 101
🏁 Script executed:
# View the takeValidUtf8Prefix helper (lines 19-35)
sed -n '19,35p' examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 604
🏁 Script executed:
# View the tail flush logic (lines 250-260 for context around 252-254)
sed -n '250,260p' examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 451
🏁 Script executed:
# Check how pending_text is used in the loop - search for the streaming context
sed -n '240,270p' examples/qwen_ascend/main.cppRepository: UbiquitousLearning/mllm
Length of output: 981
Handle irrecoverable leading bytes to avoid stalled output and invalid tail flush.
When utf8::find_invalid detects an invalid byte at the buffer start, the helper returns empty without consuming it. An irrecoverably invalid leading byte (e.g., a malformed UTF-8 start octet) will therefore block progress indefinitely—the next iteration finds the same byte, returns empty again, and loops without advancing. Additionally, the final flush at line 252-254 directly prints pending_text without validation, allowing invalid UTF-8 to reach output even though the streaming loop uses the validated takeValidUtf8Prefix.
Discard irrecoverable leading bytes that exceed the maximum UTF-8 sequence length (4 bytes), and apply takeValidUtf8Prefix to the tail flush to ensure output is valid UTF-8.
Proposed fix
std::string takeValidUtf8Prefix(std::string& pending_text) {
+ if (pending_text.empty()) {
+ return {};
+ }
auto invalid = utf8::find_invalid(pending_text.begin(), pending_text.end());
if (invalid == pending_text.begin()) {
- return {};
+ // Keep short prefixes that may become valid with future bytes.
+ // UTF-8 max code point width is 4 bytes; longer invalid-at-begin likely means malformed lead.
+ if (pending_text.size() <= 4) {
+ return {};
+ }
+ pending_text.erase(0, 1);
+ return {};
}
@@
- if (!pending_text.empty()) {
- fmt::print("{}", pending_text);
- }
+ auto tail = takeValidUtf8Prefix(pending_text);
+ if (!tail.empty()) {
+ fmt::print("{}", tail);
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/qwen_ascend/main.cpp` around lines 19 - 35, The helper
takeValidUtf8Prefix currently returns empty when an invalid byte is at the
buffer start and never consumes it; update takeValidUtf8Prefix to detect
irrecoverable leading bytes (e.g., a start octet that would require >4
continuation bytes or otherwise cannot form a valid UTF‑8 sequence) and
consume/discard one invalid byte so progress continues, while still returning
empty when nothing valid precedes it; also ensure the final flush path uses
takeValidUtf8Prefix on pending_text (instead of printing pending_text directly)
so only validated UTF‑8 is emitted. Reference: function takeValidUtf8Prefix and
the final tail-flush that prints pending_text.
Please check Guidelines for Contributing.
Summary by CodeRabbit
New Features
Improvements