Skip to content

Releases: ROCm/madengine

v2.1.0

29 May 13:42
3f617b9

Choose a tag to compare

What's Changed

  • feat(deployment): add slurm_multi self-managed multi-node SLURM launcher (Revised PR#124) by @coketaste in #130
  • added build-context to docker cmd for ./tools location so various doc… by @ggankhuy in #131
  • docs: add Docker build-context tools entry to v2.1.0 CHANGELOG by @coketaste in #133

New Contributors

Full Changelog: v2.0.3...v2.1.0

v2.0.3

27 May 01:24
8d86f45

Choose a tag to compare

What's Changed

  • refactor(k8s): decompose kubernetes.py into focused mixin modules by @coketaste in #120
  • feat: make rocenv tool lite/full mode configurable via additional_context by @coketaste in #125
  • refactor(v2-review): shell injection hardening, bug fixes, and test cleanup by @coketaste in #122
  • fix(build): sanitize slashes in multi-arch image names by @coketaste in #127
  • docs: add codebase wiki by @coketaste in #128
  • (fix) Update the path of wiki by @coketaste in #129
  • fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment by @coketaste in #126

Full Changelog: v2.0.2...v2.0.3

v2.0.2

06 May 18:27
75870d2

Choose a tag to compare

What's Changed

  • refactor(v2-review): fix timeout, extract auth module, fix security issues, update test suite by @coketaste in #108

Full Changelog: v2.0.1...v2.0.2

v2.0.1

28 Apr 23:48
4173f49

Choose a tag to compare

What's Changed

  • Align CLI utils, dataprovider, container_runner with v2 by @coketaste in #97
  • feat(profiling): add rocm_trace_lite (RTL) and multi-node tool filtering by @coketaste in #98
  • Profiling: RTL_MODE for rocm_trace_lite and rtl trace --mode in wrapper by @coketaste in #100
  • feat(deployment): Primus on local/K8s/SLURM by @coketaste in #99
  • Follow-up: align Primus with launchers by @coketaste in #103
  • Add codeowners by @gargrahul in #105
  • feat(discover): add short-name backward-compat matching for dir-prefixed models by @amathews-amd in #104
  • refactor(auth): centralize credential loading and Docker registry login by @coketaste in #106
  • Revert "refactor(auth): centralize credential loading and Docker registry login" by @gargrahul in #107
  • Update performance regex pattern for log matching by @Saiamd999 in #101
  • refactor(discover): scope-based model tag selection (unscoped vs scoped) by @coketaste in #109
  • refactor(gpu-arch): auto-detect MAD_SYSTEM_GPU_ARCHITECTURE for local full-run mode by @coketaste in #113
  • fix(csv): strip whitespace from CSV fieldnames parsing multiple_results by @coketaste in #115
  • feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers) by @coketaste in #110
  • Pass MAD_OUTPUT_CSV by @amathews-amd in #117

New Contributors

Full Changelog: v2.0.0...v2.0.1

v2.0.0

09 Apr 13:59
04aac39

Choose a tag to compare

🎉 What's New

madengine v2.0 is a complete rewrite of the MAD orchestration engine with a modern, production-ready architecture. This major release replaces the legacy v1.x codebase with a unified CLI, comprehensive error handling, and support for distributed AI workloads across Kubernetes and SLURM.

🚀 Key Highlights

Unified CLI Experience

One command to rule them all: madengine now provides a consistent interface for all operations.

Multi-Target Deployment

Run AI workloads wherever you need them:

  • Local: Direct Docker execution for development and single-GPU jobs
  • Kubernetes: Production-ready K8s Jobs with full launcher support
  • SLURM: HPC cluster integration with intelligent job scheduling

Distributed Framework Support

Native support for 6 distributed training and inference frameworks:

Training:

  • torchrun (PyTorch DDP/FSDP)
  • DeepSpeed (ZeRO optimization)
  • Megatron-LM (large-scale transformers)
  • TorchTitan (LLM pre-training with FSDP2+TP+PP+CP)

Inference:

  • vLLM (high-throughput LLM inference)
  • SGLang (structured generation)

All launchers work seamlessly with both Kubernetes and SLURM deployments.

Advanced Profiling

Comprehensive ROCm profiling suite for AMD GPUs:

  • 8 pre-configured profiles: compute, memory, communication, full analysis, and more
  • ROCprofv3 support: Latest ROCm 7.0+ profiling capabilities
  • Perfetto integration: Generate traces for Perfetto UI visualization
  • Ready-to-use configs: 6 example configurations in examples/profiling-configs/

Production-Grade Quality

  • 4.5/5 code quality rating (detailed metrics in CODE_QUALITY_REPORT_v2.md)
  • 71% type hint coverage with mypy validation
  • Zero technical debt: No TODO/FIXME/HACK markers
  • Pre-commit hooks: Automated quality checks (black, isort, flake8, mypy, bandit)
  • Security fixes: SQL injection vulnerability patched, improved exception handling

What's Changed

  • madengine v2 with unified framework for local and distribution by @coketaste in #57

Full Changelog: v1.0.0...v2.0.0

v1.0.0

08 Apr 21:07
4438d32

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/ROCm/madengine/commits/v1.0.0