Fix XGrammar bitmask initialization and add null check for gen_config in generate method#4349
Conversation
20ef3f9 to
547c994
Compare
There was a problem hiding this comment.
Pull request overview
This pull request fixes a bug where grammar constraints from structured output requests incorrectly persist and apply to subsequent non-structured output requests when ModelRequest instances are reused. The fix introduces a clearGrammar() method to explicitly reset the grammar state after each inference session completes.
Changes:
- Added
clearGrammar()C++ method to ModelRequest class for resetting grammar state - Added Python binding
clear_grammar()to expose the cleanup functionality - Called
clear_grammar()in the finally block ofasync_stream_infer()to ensure cleanup
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/turbomind/engine/model_request.h | Declares the new clearGrammar() method |
| src/turbomind/engine/model_request.cc | Implements clearGrammar() to reset the grammar_ shared_ptr |
| src/turbomind/python/bind.cpp | Adds Python binding for clear_grammar with appropriate GIL handling |
| lmdeploy/turbomind/turbomind.py | Calls clear_grammar() in finally block after inference completes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f606bb6 to
f81c224
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f81c224 to
de1e35e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
de1e35e to
f74ca9b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2e518d2 to
e8c7c32
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e8c7c32 to
52fbdc5
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
21e8300 to
5bc9970
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
5bc9970 to
762bcb2
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
762bcb2 to
03aadff
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
973a080 to
73b4103
Compare
Motivation
This PR addresses two critical issues in LMDeploy's guided generation and batch inference functionality:
XGrammar Bitmask Initialization Bug: The guided decoding mechanism initializes token bitmasks to control which tokens are allowed during constrained generation. Previously, the bitmask was initialized with zeros, which incorrectly blocked all tokens instead of allowing all tokens. This caused generation failures when certain sequences in a batch didn't have grammar constraints applied.
Silent Failure on None gen_config: The
batch_infermethod accepts a list ofGenerationConfigobjects, but whenNonewas passed for individual items, the code would silently fail or hang due to unhandled null pointer dereference when accessinggen_config.max_new_tokens.Modifications
Guided Decoding Bitmask Fix (
src/turbomind/generation/guided_decoding.cc):0to-1(all bits set to 1 in two's complement representation forint32_t).Null GenerationConfig Guard (
lmdeploy/serve/core/async_engine.py):GenerationConfig()whengen_configisNone.Regression Test (
tests/test_lmdeploy/test_grammar.py):test_mix_guided_matrixto verify that batch inference works correctly when mixing guided and unguided generation in the same batch.