Release 0.20.10 · dstackai/dstack

Services

Prefill-Decode disaggregation

dstack now supports disaggregated Prefill–Decode inference, allowing both Prefill and Decode worker types to run within a single service.

To define and run such a service, set pd_disaggregation to true under the router property (this requires the gateway to use the sglang router, and define separate replica groups for Prefill and Decode worker types:

type: service
name: prefill-decode

env:
  - HF_TOKEN
  - MODEL_ID=zai-org/GLM-4.5-Air-FP8

image: lmsysorg/sglang:latest

replicas:
  - count: 1..4
    scaling:
      metric: rps
      target: 3
    commands:
      - |
          python -m sglang.launch_server \
            --model-path $MODEL_ID \
            --disaggregation-mode prefill \
            --disaggregation-transfer-backend mooncake \
            --host 0.0.0.0 \
            --port 8000 \
            --disaggregation-bootstrap-port 8998
    resources:
      gpu: H200

  - count: 1..8
    scaling:
      metric: rps
      target: 2
    commands:
      - |
          python -m sglang.launch_server \
            --model-path $MODEL_ID \
            --disaggregation-mode decode \
            --disaggregation-transfer-backend mooncake \
            --host 0.0.0.0 \
            --port 8000
    resources:
      gpu: H200

port: 8000
model: zai-org/GLM-4.5-Air-FP8

probes:
  - type: http
    url: /health_generate
    interval: 15s

router:
  type: sglang
  pd_disaggregation: true

Note

Note, pd_disaggregation requires both the gateway and replicas to use the same cluster. With dstack, this can now be used with the aws, gcp, kubernetes backends (as they support creating both clusters and gateways). Support for more backends (and eventually SSH fleets) is coming soon.

Currently, pd_disaggregation works only with SGLang. Support for vLLM is coming soon.

Support for additional scaling metrics, such as TTFT and ITL, is also coming soon to enable autoscaling of Prefill and Decode workers.

Model endpoint

If you configure the model property, dstack previously provided a global model endpoint at gateway.<gateway domain> (or /proxy/models/<project name>), allowing access to all models deployed in the project. This endpoint has been deprecated.

Now, any deployed model should be accessed via the service endpoint itself at <run name>.<gateway domain> (or /proxy/services/main/<service name>).

Note

If you configure the model property, dstack automatically enables CORS on the service endpoint. Future versions will allow you to disable or customize this behavior.

CLI

`dstack apply`

Previously, if you did not specify gpu, dstack treated it as 0..1 but did not display it in the run plan. Now, dstack properly displays this default. Additionally, if you do not specify image, dstack automatically defaults the vendor to nvidia.

dstack apply -f dev.dstack.yml
 Project              peterschmidt85
 User                 peterschmidt85
 Type                 dev-environment
 Resources            cpu=2.. mem=8GB.. disk=100GB.. gpu=0..
 Spot policy          on-demand
 Max price            off
 Retry policy         off
 Idle duration        5m
 Max duration         off
 Inactivity duration  off

 #  BACKEND         RESOURCES                  INSTANCE TYPE  PRICE
 1  verda (FIN-01)  cpu=4 mem=16GB disk=100GB  CPU.4V.16G     $0.0279
 2  verda (FIN-02)  cpu=4 mem=16GB disk=100GB  CPU.4V.16G     $0.0279
 3  verda (FIN-03)  cpu=4 mem=16GB disk=100GB  CPU.4V.16G     $0.0279
    ...

Submit the run dev? [y/n]:

This makes the run plan much more explicit and clear.

What's changed

[Docs] Nebius example under Clusters by @peterschmidt85 in #3567
[Docs] Add get nodes rule to K8s ClusterRole by @un-def in #3571
[Docs] Clarified the behavior of idle duration: how run's idle_duration and fleet's idle_duration are applied by @peterschmidt85 in #3574
[runner] Don't bind to public addresses by @un-def in #3575
Migrate service model base url by @peterschmidt85 in #3560
Set explicit GPU defaults in ResourcesSpec and improve default GPU vendor selection by @peterschmidt85 in #3573
Add --verbose to dstack apply and enhance run plan output by @peterschmidt85 in #3572
Cosmetical changes to the home page (font; headline; etc) by @peterschmidt85 in #3582
Implement pipeline tasks by @r4victor in #3581
Add pd disaggregated inference by @Bihan in #3558
Group db migrations by @r4victor in #3583
Clarify GPU vendor inference comments (follow-up to #3573) by @peterschmidt85 in #3588
Kubernetes: gateway: start services via docker-systemctl-replacement by @un-def in #3584
Remove dangling services from gateway by @jvstme in #3586
[runner] Check capabilities(7) by @un-def in #3587
[runner] Check if repo dir exists before chown by @un-def in #3589

Full changelog: 0.20.9...0.20.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.20.10

Choose a tag to compare

Sorry, something went wrong.