server: honour per-request reasoning_budget_tokens in chat completions#23116
Open
bernardladenthin wants to merge 4 commits into
Open
server: honour per-request reasoning_budget_tokens in chat completions#23116bernardladenthin wants to merge 4 commits into
bernardladenthin wants to merge 4 commits into
Conversation
The reasoning-budget block in oaicompat_chat_params_parse read only the server-level default (opt.reasoning_budget, typically -1) and the Anthropic-style alias thinking_budget_tokens, but never the canonical reasoning_budget_tokens field from the request body. Because the key was then written into llama_params before the generic body-copy loop ran, the copy loop found the key already present and silently skipped the caller-supplied value. Any per-request override (e.g. 0 to suppress thinking entirely) was therefore discarded. Fix: read reasoning_budget_tokens from the request body first, so the value that reaches the sampling layer is the one the caller intended. Add a unit test in test-chat.cpp that exercises this path via oaicompat_chat_params_parse with a Qwen3 template (which the autoparser detects as a thinking-capable model) and asserts the returned llama_params carries reasoning_budget_tokens == 0.
|
Hi @bernardladenthin, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
pwilkin
approved these changes
Jun 10, 2026
pwilkin
left a comment
Member
There was a problem hiding this comment.
Yeah, this seems good to me and it's a minimal change. Could you also add support for runtime reasoning_budget_message while we're at it?
The reasoning-budget block in oaicompat_chat_params_parse wrote reasoning_budget_message into llama_params straight from the server-level default (opt.reasoning_budget_message) and never read the canonical reasoning_budget_message field from the request body. Because the key was written before the generic body-copy loop ran, that loop found the key already present and silently skipped the caller-supplied value. Any per-request override of the message injected before the end tag when the budget is exhausted was therefore discarded, even though server-task.cpp already reads reasoning_budget_message from that data. This mirrors the reasoning_budget_tokens bug fixed in the previous commit. Fix: read reasoning_budget_message from the request body first, falling back to the server default, so the value that reaches the sampling layer is the one the caller intended. While here, collapse the adjacent reasoning_budget_tokens override to a single json_value() call; json_value already falls back to the default on a missing/null/wrong-type key, so the explicit body.contains() guard was redundant. No behavioral change. Add a unit test in test-chat.cpp that exercises this path via oaicompat_chat_params_parse with a Qwen3 template (which the autoparser detects as a thinking-capable model) and asserts the returned llama_params carries the per-request reasoning_budget_message rather than the server default.
Author
Done, please check and wait for ci. Ty |
Author
|
I'd appreciate your guidance here also: #22393 |
pwilkin
approved these changes
Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
I got a user request to investigate and fix an issue: Its done in java-llama.cpp but depends on a fix inside llama.cpp
bernardladenthin/java-llama.cpp#140
Add support for per-request
reasoning_budget_tokensparameter in chat completions API calls, allowing clients to override the server's default reasoning budget on a per-request basis.The implementation reads the
reasoning_budget_tokensvalue from the request body before the parameter copy loop, ensuring the per-request value takes precedence over the server configuration. This follows the existing pattern used forthinking_budget_tokens.Changes
oaicompat_chat_params_parse()to check for and apply per-requestreasoning_budget_tokensoverride before the parameter copy looptest_reasoning_budget_tokens_per_request()test case to verify that per-request values override server defaultsTest Plan
Added unit test
test_reasoning_budget_tokens_per_request()that:reasoning_budget_tokens=0overrides the server default of-1Additional information
This change enables fine-grained control over reasoning budget allocation per API request, which is useful for applications that need to dynamically adjust thinking/reasoning behavior based on specific use cases or user preferences.
Requirements