Skip to content

server: honour per-request reasoning_budget_tokens in chat completions#23116

Open
bernardladenthin wants to merge 4 commits into
ggml-org:masterfrom
bernardladenthin:fix-reasoning-budget-tokens
Open

server: honour per-request reasoning_budget_tokens in chat completions#23116
bernardladenthin wants to merge 4 commits into
ggml-org:masterfrom
bernardladenthin:fix-reasoning-budget-tokens

Conversation

@bernardladenthin

Copy link
Copy Markdown

Overview

I got a user request to investigate and fix an issue: Its done in java-llama.cpp but depends on a fix inside llama.cpp

bernardladenthin/java-llama.cpp#140

Add support for per-request reasoning_budget_tokens parameter in chat completions API calls, allowing clients to override the server's default reasoning budget on a per-request basis.

The implementation reads the reasoning_budget_tokens value from the request body before the parameter copy loop, ensuring the per-request value takes precedence over the server configuration. This follows the existing pattern used for thinking_budget_tokens.

Changes

  • server-common.cpp: Modified oaicompat_chat_params_parse() to check for and apply per-request reasoning_budget_tokens override before the parameter copy loop
  • test-chat.cpp: Added test_reasoning_budget_tokens_per_request() test case to verify that per-request values override server defaults

Test Plan

Added unit test test_reasoning_budget_tokens_per_request() that:

  • Uses the Qwen3 template with reasoning markers
  • Verifies that a per-request reasoning_budget_tokens=0 overrides the server default of -1
  • Confirms the parameter is correctly passed through to the sampling layer

Additional information

This change enables fine-grained control over reasoning budget allocation per API request, which is useful for applications that need to dynamically adjust thinking/reasoning behavior based on specific use cases or user preferences.

Requirements

The reasoning-budget block in oaicompat_chat_params_parse read only the
server-level default (opt.reasoning_budget, typically -1) and the
Anthropic-style alias thinking_budget_tokens, but never the canonical
reasoning_budget_tokens field from the request body.  Because the key
was then written into llama_params before the generic body-copy loop
ran, the copy loop found the key already present and silently skipped
the caller-supplied value.  Any per-request override (e.g. 0 to
suppress thinking entirely) was therefore discarded.

Fix: read reasoning_budget_tokens from the request body first, so the
value that reaches the sampling layer is the one the caller intended.

Add a unit test in test-chat.cpp that exercises this path via
oaicompat_chat_params_parse with a Qwen3 template (which the autoparser
detects as a thinking-capable model) and asserts the returned
llama_params carries reasoning_budget_tokens == 0.
@bernardladenthin bernardladenthin requested review from a team and pwilkin as code owners May 15, 2026 18:41
@ggml-gh-bot

ggml-gh-bot Bot commented May 15, 2026

Copy link
Copy Markdown

Hi @bernardladenthin, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added testing Everything test related examples server labels May 15, 2026

@pwilkin pwilkin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems good to me and it's a minimal change. Could you also add support for runtime reasoning_budget_message while we're at it?

The reasoning-budget block in oaicompat_chat_params_parse wrote
reasoning_budget_message into llama_params straight from the server-level
default (opt.reasoning_budget_message) and never read the canonical
reasoning_budget_message field from the request body. Because the key
was written before the generic body-copy loop ran, that loop found the
key already present and silently skipped the caller-supplied value. Any
per-request override of the message injected before the end tag when the
budget is exhausted was therefore discarded, even though server-task.cpp
already reads reasoning_budget_message from that data.

This mirrors the reasoning_budget_tokens bug fixed in the previous commit.

Fix: read reasoning_budget_message from the request body first, falling
back to the server default, so the value that reaches the sampling layer
is the one the caller intended.

While here, collapse the adjacent reasoning_budget_tokens override to a
single json_value() call; json_value already falls back to the default on
a missing/null/wrong-type key, so the explicit body.contains() guard was
redundant. No behavioral change.

Add a unit test in test-chat.cpp that exercises this path via
oaicompat_chat_params_parse with a Qwen3 template (which the autoparser
detects as a thinking-capable model) and asserts the returned
llama_params carries the per-request reasoning_budget_message rather than
the server default.
@bernardladenthin

Copy link
Copy Markdown
Author

Yeah, this seems good to me and it's a minimal change. Could you also add support for runtime reasoning_budget_message while we're at it?

Done, please check and wait for ci. Ty

@bernardladenthin

Copy link
Copy Markdown
Author

I'd appreciate your guidance here also: #22393

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants