Releases: ROCm/madengine
v2.1.0
What's Changed
- feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124) by @coketaste in #130
- added build-context to docker cmd for ./tools location so various doc… by @ggankhuy in #131
- docs: add Docker build-context tools entry to v2.1.0 CHANGELOG by @coketaste in #133
New Contributors
Full Changelog: v2.0.3...v2.1.0
v2.0.3
What's Changed
- refactor(k8s): decompose kubernetes.py into focused mixin modules by @coketaste in #120
- feat: make rocenv tool lite/full mode configurable via additional_context by @coketaste in #125
- refactor(v2-review): shell injection hardening, bug fixes, and test cleanup by @coketaste in #122
- fix(build): sanitize slashes in multi-arch image names by @coketaste in #127
- docs: add codebase wiki by @coketaste in #128
- (fix) Update the path of wiki by @coketaste in #129
- fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment by @coketaste in #126
Full Changelog: v2.0.2...v2.0.3
v2.0.2
What's Changed
- refactor(v2-review): fix timeout, extract auth module, fix security issues, update test suite by @coketaste in #108
Full Changelog: v2.0.1...v2.0.2
v2.0.1
What's Changed
- Align CLI utils, dataprovider, container_runner with v2 by @coketaste in #97
- feat(profiling): add rocm_trace_lite (RTL) and multi-node tool filtering by @coketaste in #98
- Profiling: RTL_MODE for rocm_trace_lite and rtl trace --mode in wrapper by @coketaste in #100
- feat(deployment): Primus on local/K8s/SLURM by @coketaste in #99
- Follow-up: align Primus with launchers by @coketaste in #103
- Add codeowners by @gargrahul in #105
- feat(discover): add short-name backward-compat matching for dir-prefixed models by @amathews-amd in #104
- refactor(auth): centralize credential loading and Docker registry login by @coketaste in #106
- Revert "refactor(auth): centralize credential loading and Docker registry login" by @gargrahul in #107
- Update performance regex pattern for log matching by @Saiamd999 in #101
- refactor(discover): scope-based model tag selection (unscoped vs scoped) by @coketaste in #109
- refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode by @coketaste in #113
- fix(csv): strip whitespace from CSV fieldnames parsing multiple_results by @coketaste in #115
- feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers) by @coketaste in #110
- Pass MAD_OUTPUT_CSV by @amathews-amd in #117
New Contributors
- @amathews-amd made their first contribution in #104
Full Changelog: v2.0.0...v2.0.1
v2.0.0
🎉 What's New
madengine v2.0 is a complete rewrite of the MAD orchestration engine with a modern, production-ready architecture. This major release replaces the legacy v1.x codebase with a unified CLI, comprehensive error handling, and support for distributed AI workloads across Kubernetes and SLURM.
🚀 Key Highlights
Unified CLI Experience
One command to rule them all: madengine now provides a consistent interface for all operations.
Multi-Target Deployment
Run AI workloads wherever you need them:
- Local: Direct Docker execution for development and single-GPU jobs
- Kubernetes: Production-ready K8s Jobs with full launcher support
- SLURM: HPC cluster integration with intelligent job scheduling
Distributed Framework Support
Native support for 6 distributed training and inference frameworks:
Training:
- torchrun (PyTorch DDP/FSDP)
- DeepSpeed (ZeRO optimization)
- Megatron-LM (large-scale transformers)
- TorchTitan (LLM pre-training with FSDP2+TP+PP+CP)
Inference:
- vLLM (high-throughput LLM inference)
- SGLang (structured generation)
All launchers work seamlessly with both Kubernetes and SLURM deployments.
Advanced Profiling
Comprehensive ROCm profiling suite for AMD GPUs:
- 8 pre-configured profiles: compute, memory, communication, full analysis, and more
- ROCprofv3 support: Latest ROCm 7.0+ profiling capabilities
- Perfetto integration: Generate traces for Perfetto UI visualization
- Ready-to-use configs: 6 example configurations in
examples/profiling-configs/
Production-Grade Quality
- 4.5/5 code quality rating (detailed metrics in CODE_QUALITY_REPORT_v2.md)
- 71% type hint coverage with mypy validation
- Zero technical debt: No TODO/FIXME/HACK markers
- Pre-commit hooks: Automated quality checks (black, isort, flake8, mypy, bandit)
- Security fixes: SQL injection vulnerability patched, improved exception handling
What's Changed
- madengine v2 with unified framework for local and distribution by @coketaste in #57
Full Changelog: v1.0.0...v2.0.0
v1.0.0
What's Changed
- Update README.md by @gargrahul in #1
- Update the scripts and dockers in madengine package by @coketaste in #2
- Add support of deprecated models by @coketaste in #4
- Fix the failure of unit tests by @coketaste in #6
- Use normpath and improve override argument parsing in madengine discover by @Rohan138 in #7
- Fix small issues with madengine by @GeneDer in #5
- Fix docker sha inspect by @Rohan138 in #9
- Fix the location of error in perf csv update: by @coketaste in #13
- shared memory config in docker run by @coketaste in #10
- Revert "shared memory config in docker run" by @gargrahul in #21
- Share memory control, disable ipc option when shm-size is set by @coketaste in #22
- Add MAD_SYSTEM_GPU_PRODUCT_NAME to the madengine by @coketaste in #33
- fix GPU product name on MI250,MI355, and other platforms by @Rohan138 in #34
- Refactor rocm-smi to amd-smi by @coketaste in #19
- Update profiler and tracing with ROCm7 and amd-smi by @coketaste in #44
- Add self test for MAD_SYSTEM_GPU_PRODUCT_NAME by @ahmed-bsod in #39
- Update amd-smi and utils for ROCm7 by @coketaste in #48
- Fix DataFrame concatenation warning by @ahmed-bsod in #40
- Make the validation logic smarter by @coketaste in #49
- Fix profiling using amdsmi_cli python module by @coketaste in #50
- Add proper support for multiple_columns by @Rohan138 in #51
- Add TheRock model for validation by @coketaste in #53
- Fix the cleanup by @coketaste in #60
- Perf entry superset by @coketaste in #58
- Revert "Perf entry superset" by @gargrahul in #66
- Fail Check condition update for RPM distro by @shashank-parsi in #64
- update model discovery to handle tags in subdirectories for madenginev1 by @leconcio in #83
- rocm-smi back call if amd-smi missing by @coketaste in #54
- Enhanced Perf Metric Reporting System by @coketaste in #65
New Contributors
- @gargrahul made their first contribution in #1
- @Rohan138 made their first contribution in #7
- @GeneDer made their first contribution in #5
- @ahmed-bsod made their first contribution in #39
- @shashank-parsi made their first contribution in #64
Full Changelog: https://github.com/ROCm/madengine/commits/v1.0.0