Services
Prefill-Decode disaggregation
dstack now supports disaggregated Prefill–Decode inference, allowing both Prefill and Decode worker types to run within a single service.
To define and run such a service, set pd_disaggregation to true under the router property (this requires the gateway to use the sglang router, and define separate replica groups for Prefill and Decode worker types:
type: service
name: prefill-decode
env:
- HF_TOKEN
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
image: lmsysorg/sglang:latest
replicas:
- count: 1..4
scaling:
metric: rps
target: 3
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 8000 \
--disaggregation-bootstrap-port 8998
resources:
gpu: H200
- count: 1..8
scaling:
metric: rps
target: 2
commands:
- |
python -m sglang.launch_server \
--model-path $MODEL_ID \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--host 0.0.0.0 \
--port 8000
resources:
gpu: H200
port: 8000
model: zai-org/GLM-4.5-Air-FP8
probes:
- type: http
url: /health_generate
interval: 15s
router:
type: sglang
pd_disaggregation: trueNote
Note, pd_disaggregation requires both the gateway and replicas to use the same cluster. With dstack, this can now be used with the aws, gcp, kubernetes backends (as they support creating both clusters and gateways). Support for more backends (and eventually SSH fleets) is coming soon.
Currently, pd_disaggregation works only with SGLang. Support for vLLM is coming soon.
Support for additional scaling metrics, such as TTFT and ITL, is also coming soon to enable autoscaling of Prefill and Decode workers.
Model endpoint
If you configure the model property, dstack previously provided a global model endpoint at gateway.<gateway domain> (or /proxy/models/<project name>), allowing access to all models deployed in the project. This endpoint has been deprecated.
Now, any deployed model should be accessed via the service endpoint itself at <run name>.<gateway domain> (or /proxy/services/main/<service name>).
Note
If you configure the model property, dstack automatically enables CORS on the service endpoint. Future versions will allow you to disable or customize this behavior.
CLI
dstack apply
Previously, if you did not specify gpu, dstack treated it as 0..1 but did not display it in the run plan. Now, dstack properly displays this default. Additionally, if you do not specify image, dstack automatically defaults the vendor to nvidia.
dstack apply -f dev.dstack.yml
Project peterschmidt85
User peterschmidt85
Type dev-environment
Resources cpu=2.. mem=8GB.. disk=100GB.. gpu=0..
Spot policy on-demand
Max price off
Retry policy off
Idle duration 5m
Max duration off
Inactivity duration off
# BACKEND RESOURCES INSTANCE TYPE PRICE
1 verda (FIN-01) cpu=4 mem=16GB disk=100GB CPU.4V.16G $0.0279
2 verda (FIN-02) cpu=4 mem=16GB disk=100GB CPU.4V.16G $0.0279
3 verda (FIN-03) cpu=4 mem=16GB disk=100GB CPU.4V.16G $0.0279
...
Submit the run dev? [y/n]: This makes the run plan much more explicit and clear.
What's changed
- [Docs] Nebius example under
Clustersby @peterschmidt85 in #3567 - [Docs] Add get nodes rule to K8s ClusterRole by @un-def in #3571
- [Docs] Clarified the behavior of idle duration: how run's
idle_durationand fleet'sidle_durationare applied by @peterschmidt85 in #3574 - [runner] Don't bind to public addresses by @un-def in #3575
- Migrate service model base url by @peterschmidt85 in #3560
- Set explicit GPU defaults in ResourcesSpec and improve default GPU vendor selection by @peterschmidt85 in #3573
- Add
--verbosetodstack applyand enhance run plan output by @peterschmidt85 in #3572 - Cosmetical changes to the home page (font; headline; etc) by @peterschmidt85 in #3582
- Implement pipeline tasks by @r4victor in #3581
- Add pd disaggregated inference by @Bihan in #3558
- Group db migrations by @r4victor in #3583
- Clarify GPU vendor inference comments (follow-up to #3573) by @peterschmidt85 in #3588
- Kubernetes: gateway: start services via docker-systemctl-replacement by @un-def in #3584
- Remove dangling services from gateway by @jvstme in #3586
- [runner] Check capabilities(7) by @un-def in #3587
- [runner] Check if repo dir exists before chown by @un-def in #3589
Full changelog: 0.20.9...0.20.10