Issue from orion/o6/app-development/artificial-intelligence/llama-cpp

URL: [https://docs.radxa.com/orion/o6/app-development/artificial-intelligence/llama-cpp](https://docs.radxa.com/orion/o6/app-development/artificial-intelligence/llama-cpp)

Time: 4/17/2026, 1:04:50 PM

Bug report for cix-llama-cpp (affects 1.2.4 and 1.2.6):

Title: llama-server-vulkan produces empty content in chat completions after the first request (KV cache corruption on Gemma 3 / Vulkan backend)

Reproduction:
/usr/share/cix/bin/llama-server-vulkan \
  --model <any-gemma3-gguf> \
  --device Vulkan0 --n-gpu-layers 99
Send 3 sequential POST /v1/chat/completions requests. First returns '\n', second and third return "". Token count is non-zero but text is always empty.

Root cause observed: The binary forces n_parallel=4 and kv_unified=true regardless of --parallel 1. The KV cache state after the first inference corrupts subsequent decode outputs on the Vulkan backend. On CPU (--device none) this does not occur.

Secondary bug: --jinja flag causes the server to hang indefinitely on the first request (no response, connection accepted but never answered).

Workaround: None currently available that preserves Vulkan acceleration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue from orion/o6/app-development/artificial-intelligence/llama-cpp #1701

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue from orion/o6/app-development/artificial-intelligence/llama-cpp #1701

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions