Skip to content

Issue from orion/o6/app-development/artificial-intelligence/llama-cpp #1701

@JackAAnders

Description

@JackAAnders

URL: https://docs.radxa.com/orion/o6/app-development/artificial-intelligence/llama-cpp

Time: 4/17/2026, 1:04:50 PM

Bug report for cix-llama-cpp (affects 1.2.4 and 1.2.6):

Title: llama-server-vulkan produces empty content in chat completions after the first request (KV cache corruption on Gemma 3 / Vulkan backend)

Reproduction:
/usr/share/cix/bin/llama-server-vulkan
--model
--device Vulkan0 --n-gpu-layers 99
Send 3 sequential POST /v1/chat/completions requests. First returns '\n', second and third return "". Token count is non-zero but text is always empty.

Root cause observed: The binary forces n_parallel=4 and kv_unified=true regardless of --parallel 1. The KV cache state after the first inference corrupts subsequent decode outputs on the Vulkan backend. On CPU (--device none) this does not occur.

Secondary bug: --jinja flag causes the server to hang indefinitely on the first request (no response, connection accepted but never answered).

Workaround: None currently available that preserves Vulkan acceleration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions