Implement multi-dimensional DistConv sharding by PatrickRMiles · Pull Request #27 · LBANN/ScaFFold

PatrickRMiles · 2026-03-11T18:33:06Z

No description provided.

…to deal with 1D inputs

…t pass them as-is to the ParallelStrategy call

ScaFFold/utils/evaluate.py

michaelmckinsey1 · 2026-03-12T19:11:15Z

ScaFFold/configs/benchmark_default.yml

+problem_scale: 8                   # Determines dataset resolution and number of unet layers. Default is 6.
 unet_bottleneck_dim: 3             # Power of 2 of the unet bottleneck layer dimension. Default of 3 -> bottleneck layer of size 8.
 seed: 42                           # Random seed.
 batch_size: 1                      # Batch sizes for each vol size.
 optimizer: "ADAM"                  # "ADAM" is preferred option, otherwise training defautls to RMSProp.
-num_shards: 2                      # DistConv param: number of shards to divide the tensor into. It's best to choose the fewest ranks needed to fit one sample in GPU memory, since that keeps communication at a minimum
-shard_dim: 2                       # DistConv param: dimension on which to shard
+dc_num_shards: [1, 1, 2]              # DistConv param: number of shards to divide the tensor into. It's best to choose the fewest ranks needed to fit one sample in GPU memory, since that keeps communication at a minimum
+dc_shard_dims: [2, 3, 4]               # DistConv param: dimension on which to shard


What should default scale be? Maybe we need scale-specific configs

problem_scale: 7 dc_num_shards: [1,1,1] problem_scale: 8 dc_num_shards: [1,1,2] problem_scale: 9 dc_num_shards: [2,2,4] unet_bottleneck_dim: 4

ScaFFold/utils/trainer.py

michaelmckinsey1 · 2026-03-13T23:47:46Z

ScaFFold/utils/config_utils.py

+def _ensure_tuple(val):
+    """
+    Ensures the input value is converted to a tuple of integers.
+    Handles: int, list, tuple, and string representations like "[2,2]" or "2,2".
+    """
+    if val is None:
+        return (1,)  # Default safety
+    if isinstance(val, (list, tuple)):
+        return tuple(int(i) for i in val)
+    if isinstance(val, str):
+        # Handle cases where user might type literal "(2, 2, 2)" in YAML or "2,2" in CLI
+        val = val.strip("()[]").split(",")
+        return tuple(int(i.strip()) for i in val if i.strip())
+    # Fallback for single integer
+    return (
+        1,
+        1,
+        int(val),
+    )


I get this is for the way the current cli arg is set for --num-shards, but I think we should just change that and remove this. As it is not clear the single value set will become the third dimension in the tuple

heres a patch for the changes

diff --git a/ScaFFold/cli.py b/ScaFFold/cli.py index 840a7e6..3c73f40 100644 --- a/ScaFFold/cli.py +++ b/ScaFFold/cli.py @@ -157,8 +157,9 @@ def main(): help="Resume execution in this specific directory. Overrides --base-run-dir.", ) benchmark_parser.add_argument( - "--num-shards", + "--dc-num-shards", type=int, + nargs=3, help="DistConv param: number of shards to divide the tensor into. It's best to choose the fewest ranks needed to fit one sample in GPU memory, since that keeps communication at a minimum", ) benchmark_parser.add_argument( diff --git a/ScaFFold/utils/config_utils.py b/ScaFFold/utils/config_utils.py index b6d77d2..378dc51 100644 --- a/ScaFFold/utils/config_utils.py +++ b/ScaFFold/utils/config_utils.py @@ -73,8 +73,8 @@ class Config: self.target_dice = config_dict["target_dice"] self.checkpoint_interval = config_dict["checkpoint_interval"] - self.dc_num_shards = _ensure_tuple(config_dict.get("dc_num_shards", (1, 1, 1))) - self.dc_shard_dims = _ensure_tuple(config_dict.get("dc_shard_dims", (2, 3, 4))) + self.dc_num_shards = config_dict["dc_num_shards"] + self.dc_shard_dims = config_dict["dc_shard_dims"] self.dc_total_shards = math.prod(self.dc_num_shards) # Safety Check: Length mismatch if len(self.dc_num_shards) != len(self.dc_shard_dims): @@ -113,24 +113,3 @@ def load_config(file_path: str, config_type: str): raise ValueError( f"Invalid config type specified: {type}. Must be either 'sweep' or 'run'" ) - - -def _ensure_tuple(val): - """ - Ensures the input value is converted to a tuple of integers. - Handles: int, list, tuple, and string representations like "[2,2]" or "2,2". - """ - if val is None: - return (1,) # Default safety - if isinstance(val, (list, tuple)): - return tuple(int(i) for i in val) - if isinstance(val, str): - # Handle cases where user might type literal "(2, 2, 2)" in YAML or "2,2" in CLI - val = val.strip("()[]").split(",") - return tuple(int(i.strip()) for i in val if i.strip()) - # Fallback for single integer - return ( - 1, - 1, - int(val), - ) diff --git a/ScaFFold/utils/evaluate.py b/ScaFFold/utils/evaluate.py index 6f8da8a..5c907c4 100644 --- a/ScaFFold/utils/evaluate.py +++ b/ScaFFold/utils/evaluate.py @@ -17,6 +17,7 @@ import torch.nn.functional as F from distconv import DCTensor from torch.distributed.tensor import DTensor, Replicate, Shard, distribute_tensor from tqdm import tqdm +import numpy as np from ScaFFold.utils.dice_score import ( SpatialAllReduce,

michaelmckinsey1 · 2026-03-13T23:59:10Z

~~Could you also make unet_bottleneck_dim a CLI argument for running scale 9+. This would be convienient~~ nvm it already is a CLI arg

ScaFFold/worker.py

Patrick Miles added 17 commits March 10, 2026 14:40

update config with 3D num_shards and shard_dim

4995d8f

update config util to expect 3D num_shards and shard_dim, add helper …

62f1323

…to deal with 1D inputs

worker no longer needs to modify distconv params set in config -- jus…

85acd07

…t pass them as-is to the ParallelStrategy call

implement multi-dimensional sharding for distconv

017377a

update distconv param name scheme

e706d27

fix loss calc

34d3b9f

add sharded dice loss calculation to dice score util

f038256

update evaluate to use sharded dice loss calc

713ed64

update trainer to use new evaluate; other small fixes/tweaks

692f1a6

fix assert

50a565b

fix naming

49fffe5

fix naming

338d034

better default values

9d23fdb

missing import

4667f5e

update distconv param names, default vals

6f21c35

use np.prod instead of math.prod

61030b9

ruff

7775742

michaelmckinsey1 reviewed Mar 12, 2026

View reviewed changes

ScaFFold/utils/evaluate.py Show resolved Hide resolved

michaelmckinsey1 reviewed Mar 12, 2026

View reviewed changes

ScaFFold/utils/trainer.py Show resolved Hide resolved

michaelmckinsey1 reviewed Mar 13, 2026

View reviewed changes

michaelmckinsey1 reviewed Mar 16, 2026

View reviewed changes

ScaFFold/worker.py Show resolved Hide resolved

PatrickRMiles and others added 4 commits March 17, 2026 15:57

Merge branch 'LBANN:main' into miles30/multidim_distconv

d2ef9cb

import math

a8a940d

warmup logging and timing

a0e13ad

import math for prod

8636ade

PatrickRMiles changed the title ~~Draft: Implement multi-dimensional DistConv sharding~~ Implement multi-dimensional DistConv sharding Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multi-dimensional DistConv sharding#27

Implement multi-dimensional DistConv sharding#27
PatrickRMiles wants to merge 21 commits intoLBANN:mainfrom
PatrickRMiles:miles30/multidim_distconv

PatrickRMiles commented Mar 11, 2026

Uh oh!

Uh oh!

michaelmckinsey1 Mar 12, 2026

Uh oh!

Uh oh!

michaelmckinsey1 Mar 13, 2026

Uh oh!

michaelmckinsey1 Mar 13, 2026

Uh oh!

michaelmckinsey1 commented Mar 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PatrickRMiles commented Mar 11, 2026

Uh oh!

Uh oh!

michaelmckinsey1 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaelmckinsey1 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

michaelmckinsey1 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

michaelmckinsey1 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelmckinsey1 commented Mar 13, 2026 •

edited

Loading