-
Notifications
You must be signed in to change notification settings - Fork 655
Open
Description
Description
When running GLM-4.7-Flash-AWQ on a single V100-32G-SXM2, the KV Cache memory estimation appears to be incorrect, causing premature context length truncation.
Environment
- GPU: NVIDIA V100-32G-SXM2
- CUDA: 12.9
- Model: GLM-4.7-Flash-AWQ (18.4GB weights)
- Framework: LMDeploy (TurboMind backend) with PR GLM-4.7-Flash Turbomind support #4362
Observed Behavior
- Context length initialized: Only 4928 tokens (severely truncated)
- Expected behavior: Should support much longer context with 32GB VRAM
Error Logs
[TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752.
[TM][WARNING] [SegMgr] prefix caching is enabled
[TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory
[TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6
[TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6
Root Cause Analysis
The issue appears to be in the block manager's maximum block calculation logic. The session length is truncated based on:
const auto max_cached_tokens = seq_mgr_->max_block_count() * (size_t)cache_block_seq_len * param_.attn_cp_size;
session_len_trunc_ = std::min(max_cached_tokens, (size_t)param_.session_len);Located at: lmdeploy/src/turbomind/engine/engine.cc (lines 248-253)
Questions
- Is the 2.6MB/token KV Cache usage expected for this model configuration?
- Why does the block manager underestimate available memory on V100?
@windreamer 我在单块V100-32G-SXM2上测试了这个PR,可以跑通,但是KV Cache太大了,GLM-4.7-Flash-AWQ权重18.4GB,启动后总共占用31GB,但是只初始化了4928的上下文,2.6MB/token的缓存占用是否过于恐怖
Active code page: 65001 Add dll path C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin, please note cuda version should >= 11.3 when compiled with cuda 11 The following generation flags are not valid and may be ignored: ['top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details. 2026-02-24 15:35:59,808 - lmdeploy - WARNING - converter.py:67 - data type fallback to float16 since torch.cuda.is_bf16_supported is False [TM][WARNING] [TM] `max_context_token_num` is not set, default to 202752. 2026-02-24 15:36:01,275 - lmdeploy - WARNING - turbomind.py:246 - get 27431 model params [TM][WARNING] [SegMgr] prefix caching is enabled [TM][WARNING] `session_len` truncated to 4928 due to limited KV cache memory [TM][ERROR] [Engine] Warm-up for 6144 tokens failed with status 6 [TM][ERROR] [Engine] Warm-up for 8192 tokens failed with status 6 [TM][ERROR] [Engine] Warm-up for 8320 tokens failed with status 6 HINT: Please open http://127.0.0.1:10002 in a browser for detailed api usage!!! HINT: Please open http://127.0.0.1:10002 in a browser for detailed api usage!!! HINT: Please open http://127.0.0.1:10002 in a browser for detailed api usage!!! INFO: Started server process [7100] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://127.0.0.1:10002 (Press CTRL+C to quit)
Originally posted by @lingyezhixing in #4283
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels