server: honour per-request reasoning_budget_tokens in chat completions by bernardladenthin · Pull Request #23116 · ggml-org/llama.cpp

bernardladenthin · 2026-05-15T18:41:44Z

Overview

I got a user request to investigate and fix an issue: Its done in java-llama.cpp but depends on a fix inside llama.cpp

Add support for per-request reasoning_budget_tokens parameter in chat completions API calls, allowing clients to override the server's default reasoning budget on a per-request basis.

The implementation reads the reasoning_budget_tokens value from the request body before the parameter copy loop, ensuring the per-request value takes precedence over the server configuration. This follows the existing pattern used for thinking_budget_tokens.

Changes

server-common.cpp: Modified oaicompat_chat_params_parse() to check for and apply per-request reasoning_budget_tokens override before the parameter copy loop
test-chat.cpp: Added test_reasoning_budget_tokens_per_request() test case to verify that per-request values override server defaults

Test Plan

Added unit test test_reasoning_budget_tokens_per_request() that:

Uses the Qwen3 template with reasoning markers
Verifies that a per-request reasoning_budget_tokens=0 overrides the server default of -1
Confirms the parameter is correctly passed through to the sampling layer

Additional information

This change enables fine-grained control over reasoning budget allocation per API request, which is useful for applications that need to dynamically adjust thinking/reasoning behavior based on specific use cases or user preferences.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO (Claude Code was used)

The reasoning-budget block in oaicompat_chat_params_parse read only the server-level default (opt.reasoning_budget, typically -1) and the Anthropic-style alias thinking_budget_tokens, but never the canonical reasoning_budget_tokens field from the request body. Because the key was then written into llama_params before the generic body-copy loop ran, the copy loop found the key already present and silently skipped the caller-supplied value. Any per-request override (e.g. 0 to suppress thinking entirely) was therefore discarded. Fix: read reasoning_budget_tokens from the request body first, so the value that reaches the sampling layer is the one the caller intended. Add a unit test in test-chat.cpp that exercises this path via oaicompat_chat_params_parse with a Qwen3 template (which the autoparser detects as a thinking-capable model) and asserts the returned llama_params carries reasoning_budget_tokens == 0.

ggml-gh-bot · 2026-05-15T18:46:17Z

Hi @bernardladenthin, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

pwilkin

Yeah, this seems good to me and it's a minimal change. Could you also add support for runtime reasoning_budget_message while we're at it?

The reasoning-budget block in oaicompat_chat_params_parse wrote reasoning_budget_message into llama_params straight from the server-level default (opt.reasoning_budget_message) and never read the canonical reasoning_budget_message field from the request body. Because the key was written before the generic body-copy loop ran, that loop found the key already present and silently skipped the caller-supplied value. Any per-request override of the message injected before the end tag when the budget is exhausted was therefore discarded, even though server-task.cpp already reads reasoning_budget_message from that data. This mirrors the reasoning_budget_tokens bug fixed in the previous commit. Fix: read reasoning_budget_message from the request body first, falling back to the server default, so the value that reaches the sampling layer is the one the caller intended. While here, collapse the adjacent reasoning_budget_tokens override to a single json_value() call; json_value already falls back to the default on a missing/null/wrong-type key, so the explicit body.contains() guard was redundant. No behavioral change. Add a unit test in test-chat.cpp that exercises this path via oaicompat_chat_params_parse with a Qwen3 template (which the autoparser detects as a thinking-capable model) and asserts the returned llama_params carries the per-request reasoning_budget_message rather than the server default.

bernardladenthin · 2026-06-13T15:21:12Z

Yeah, this seems good to me and it's a minimal change. Could you also add support for runtime reasoning_budget_message while we're at it?

Done, please check and wait for ci. Ty

bernardladenthin · 2026-06-13T15:28:16Z

I'd appreciate your guidance here also: #22393

bernardladenthin requested review from a team and pwilkin as code owners May 15, 2026 18:41

github-actions Bot added testing Everything test related examples server labels May 15, 2026

Merge branch 'master' into fix-reasoning-budget-tokens

dd63363

pwilkin approved these changes Jun 10, 2026

View reviewed changes

Merge branch 'master' into fix-reasoning-budget-tokens

968eb53

pwilkin approved these changes Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: honour per-request reasoning_budget_tokens in chat completions#23116

server: honour per-request reasoning_budget_tokens in chat completions#23116
bernardladenthin wants to merge 4 commits into
ggml-org:masterfrom
bernardladenthin:fix-reasoning-budget-tokens

bernardladenthin commented May 15, 2026

Uh oh!

ggml-gh-bot Bot commented May 15, 2026

Uh oh!

pwilkin left a comment

Uh oh!

bernardladenthin commented Jun 13, 2026

Uh oh!

bernardladenthin commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardladenthin commented May 15, 2026

Overview

Changes

Test Plan

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented May 15, 2026

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

bernardladenthin commented Jun 13, 2026

Uh oh!

bernardladenthin commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants