Add gfx950 (MI355X) preload tuning table for preshuffle GEMM by andyluo7 · Pull Request #411 · ROCm/FlyDSL

andyluo7 · 2026-04-16T21:14:16Z

Summary

gfx950 (MI355X) has different LDS/VMEM latencies compared to gfx942 (MI300X). Without tuned preload values, preshuffle GEMM regresses -14.6% E2E on DeepSeek-R1 MoE workloads. With gfx950-specific preload table, the regression is eliminated (-0.7%, within noise).

E2E Results (DeepSeek-R1, TP=8)

Config	MI355X tok/s	Δ vs baseline
Baseline (no FlyDSL)	109.05	—
FlyDSL main (gfx942 configs)	93.11	-14.6% ❌
FlyDSL + gfx950 preload table	108.31	-0.7% ✅

Changes

Add _TILE_PRELOAD_TABLE_GFX950 with tuned (dsrd_preload, dvmem_preload) values for common tile sizes on MI355X
Update _get_preload() to accept gpu_arch parameter and select the appropriate table
Use gfx950 table for both FP8/INT8 and BF16 paths on MI355X

Testing

Tuned on Tensorwave MI355X cluster (mia1-p02-g32) with DeepSeek-R1 MoE expert shapes (N=7168, K=18432).

gfx950 has different LDS/VMEM latencies compared to gfx942 (MI300X). Without tuned preload values, preshuffle GEMM regresses -14.6% E2E on DeepSeek-R1 MoE workloads. With gfx950-specific preload table, the regression is eliminated (-0.7%, within noise). Changes: - Add _TILE_PRELOAD_TABLE_GFX950 with tuned (dsrd_preload, dvmem_preload) values for common tile sizes on MI355X - Update _get_preload() to accept gpu_arch parameter and select the appropriate table - Use gfx950 table for both FP8/INT8 and BF16 paths on MI355X Tuned on Tensorwave MI355X cluster with DeepSeek-R1 MoE expert shapes (N=7168, K=18432).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gfx950 (MI355X) preload tuning table for preshuffle GEMM#411

Add gfx950 (MI355X) preload tuning table for preshuffle GEMM#411
andyluo7 wants to merge 1 commit intoROCm:mainfrom
andyluo7:gfx950-preload-tuning

andyluo7 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andyluo7 commented Apr 16, 2026

Summary

E2E Results (DeepSeek-R1, TP=8)

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant