Skip to content

0.20.10

Latest

Choose a tag to compare

@r4victor r4victor released this 19 Feb 12:42
· 3 commits to master since this release
008efc8

Services

Prefill-Decode disaggregation

dstack now supports disaggregated Prefill–Decode inference, allowing both Prefill and Decode worker types to run within a single service.

To define and run such a service, set pd_disaggregation to true under the router property (this requires the gateway to use the sglang router, and define separate replica groups for Prefill and Decode worker types:

type: service
name: prefill-decode

env:
  - HF_TOKEN
  - MODEL_ID=zai-org/GLM-4.5-Air-FP8

image: lmsysorg/sglang:latest

replicas:
  - count: 1..4
    scaling:
      metric: rps
      target: 3
    commands:
      - |
          python -m sglang.launch_server \
            --model-path $MODEL_ID \
            --disaggregation-mode prefill \
            --disaggregation-transfer-backend mooncake \
            --host 0.0.0.0 \
            --port 8000 \
            --disaggregation-bootstrap-port 8998
    resources:
      gpu: H200

  - count: 1..8
    scaling:
      metric: rps
      target: 2
    commands:
      - |
          python -m sglang.launch_server \
            --model-path $MODEL_ID \
            --disaggregation-mode decode \
            --disaggregation-transfer-backend mooncake \
            --host 0.0.0.0 \
            --port 8000
    resources:
      gpu: H200

port: 8000
model: zai-org/GLM-4.5-Air-FP8

probes:
  - type: http
    url: /health_generate
    interval: 15s

router:
  type: sglang
  pd_disaggregation: true

Note

Note, pd_disaggregation requires both the gateway and replicas to use the same cluster. With dstack, this can now be used with the aws, gcp, kubernetes backends (as they support creating both clusters and gateways). Support for more backends (and eventually SSH fleets) is coming soon.

Currently, pd_disaggregation works only with SGLang. Support for vLLM is coming soon.

Support for additional scaling metrics, such as TTFT and ITL, is also coming soon to enable autoscaling of Prefill and Decode workers.

Model endpoint

If you configure the model property, dstack previously provided a global model endpoint at gateway.<gateway domain> (or /proxy/models/<project name>), allowing access to all models deployed in the project. This endpoint has been deprecated.

Now, any deployed model should be accessed via the service endpoint itself at <run name>.<gateway domain> (or /proxy/services/main/<service name>).

Note

If you configure the model property, dstack automatically enables CORS on the service endpoint. Future versions will allow you to disable or customize this behavior.

CLI

dstack apply

Previously, if you did not specify gpu, dstack treated it as 0..1 but did not display it in the run plan. Now, dstack properly displays this default. Additionally, if you do not specify image, dstack automatically defaults the vendor to nvidia.

dstack apply -f dev.dstack.yml
 Project              peterschmidt85
 User                 peterschmidt85
 Type                 dev-environment
 Resources            cpu=2.. mem=8GB.. disk=100GB.. gpu=0..
 Spot policy          on-demand
 Max price            off
 Retry policy         off
 Idle duration        5m
 Max duration         off
 Inactivity duration  off

 #  BACKEND         RESOURCES                  INSTANCE TYPE  PRICE
 1  verda (FIN-01)  cpu=4 mem=16GB disk=100GB  CPU.4V.16G     $0.0279
 2  verda (FIN-02)  cpu=4 mem=16GB disk=100GB  CPU.4V.16G     $0.0279
 3  verda (FIN-03)  cpu=4 mem=16GB disk=100GB  CPU.4V.16G     $0.0279
    ...

Submit the run dev? [y/n]: 

This makes the run plan much more explicit and clear.

What's changed

Full changelog: 0.20.9...0.20.10