Skip to content

KV Cache Memory Estimation Error for GLM-4.7-Flash-AWQ on V100 #4366

@windreamer

Description

@windreamer

Description

When running GLM-4.7-Flash-AWQ on a single V100-32G-SXM2, the KV Cache memory estimation appears to be incorrect, causing premature context length truncation.

Environment

Observed Behavior

  • Context length initialized: Only 4928 tokens (severely truncated)
  • Expected behavior: Should support much longer context with 32GB VRAM

Error Logs

[TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752.
[TM][WARNING] [SegMgr] prefix caching is enabled
[TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory
[TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6

Root Cause Analysis

The issue appears to be in the block manager's maximum block calculation logic. The session length is truncated based on:

const auto max_cached_tokens = seq_mgr_->max_block_count() * (size_t)cache_block_seq_len * param_.attn_cp_size;
session_len_trunc_ = std::min(max_cached_tokens, (size_t)param_.session_len);

Located at: lmdeploy/src/turbomind/engine/engine.cc (lines 248-253)

Questions

  1. Is the 2.6MB/token KV Cache usage expected for this model configuration?
  2. Why does the block manager underestimate available memory on V100?

@windreamer 我在单块V100-32G-SXM2上测试了这个PR,可以跑通,但是KV Cache太大了,GLM-4.7-Flash-AWQ权重18.4GB,启动后总共占用31GB,但是只初始化了4928的上下文,2.6MB/token的缓存占用是否过于恐怖

Active code page: 65001
Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin, please note cuda version should >= 11.3 when compiled with cuda 11
The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
2026-02-24 15:35:59,808 - lmdeploy - WARNING - converter.py:67 - data type fallback to float16 since torch.cuda.is_bf16_supported is False
[TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752.
2026-02-24 15:36:01,275 - lmdeploy - WARNING - turbomind.py:246 - get 27431 model params
[TM][WARNING] [SegMgr] prefix caching is enabled
[TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory
[TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6
HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
HINT:    Please open http://127.0.0.1:10002 in a browser for detailed api usage!!!
INFO:     Started server process [7100]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:10002 (Press CTRL+C to quit)

Originally posted by @lingyezhixing in #4283

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions