-
Notifications
You must be signed in to change notification settings - Fork 187
Description
Hi Team,
I’m facing an issue while containerizing a DeepSpeed-MII deployment using the persistent server mode.
Steps Performed:
-
I created a
mii_server.pyscript following the [DeepSpeed-MII documentation] using persistent mode:import mii MODEL_PATH = "./llama-3-70b-finetuned" # LoRA+base model merged DEPLOYMENT_NAME = "test-deepspeed" client = mii.serve( model_name_or_path=MODEL_PATH, deployment_name=DEPLOYMENT_NAME, tensor_parallel=2, enable_restful_api=True, restful_api_port=8084, max_length=2048 )
-
This script runs successfully on my local machine using 2 GPUs.
-
I built a Docker image based on the same setup and ran it using:
docker run --gpus all --shm-size=10g -e CUDA_VISIBLE_DEVICES=0,1 -p 8084:8084 <image>
However, within the container, the model consistently runs into CUDA OOM errors. Both GPUs report approximately a 10Gi deficit, which is the same as trying to run the model without tensor parallelism on a single GPU.
This leads me to believe that tensor parallelism isn’t being correctly applied in the containerized environment, even though it works as expected locally.
Environment Details
-
Model: LLaMA-3 70B (merged LoRA)
-
CUDA Version: 12.1.1
-
Python Version: 3.10
-
Requirements:
deepspeed-mii numpy==2.1.3 triton==3.3.1
Is there any additional configuration or consideration required when deploying DeepSpeed-MII in Docker to ensure tensor parallelism is honored?
Any guidance or recommended best practices for dockerizing the MII persistent server with large models would be highly appreciated.
Thanks,
Inderjeet Vishnoi
Let me know if you want to include Dockerfile/volume/mount details too.