Skip to content

plan_vit: add the muP / scaling-study ViT as a torchtitan experiment#10

Open
utkarshgill wants to merge 10 commits into
commaai:mainfrom
utkarshgill:plan-vit-experiment
Open

plan_vit: add the muP / scaling-study ViT as a torchtitan experiment#10
utkarshgill wants to merge 10 commits into
commaai:mainfrom
utkarshgill:plan-vit-experiment

Conversation

@utkarshgill

Copy link
Copy Markdown

Plan ViT for the prune-10m muP and scaling study, mirroring path's structure: model + config_registry (standard and muP flavors, n_embd 128..2048 at head_dim 64) + a thin trainer. Two cameras are channel-stacked into in_channels=24, matching the production worldmodel I/O.

Registered as the "plan_vit" experiment so it launches like path:
run.sh torchtitan/run_train.sh -e MODULE=plan_vit -e CONFIG=plan_vit_mup_w512

Self-contained plan ViT for the prune-10m muP and scaling study, mirroring
path's structure: model + config_registry (standard and muP flavors,
n_embd 128..2048 at head_dim 64) + a thin trainer. Two cameras are
channel-stacked into in_channels=24, matching the production worldmodel I/O.

Registered as the "plan_vit" experiment so it launches like path:
  run.sh torchtitan/run_train.sh -e MODULE=plan_vit -e CONFIG=plan_vit_mup_w512
the config hardcoded dp_shard=8 (only valid at world_size=8, i.e. N=1). launching
N=2 (world_size=16) tripped the parallel-dims assertion at startup. derive
replicate=num_nodes, shard=local_world_size from env like path does.
A pinned total_steps wrapped the cosine schedule, making the LR oscillate
when training.steps exceeded it. None falls back to the real training_steps.
…ase lr)

output_mult=1 made the coord check flat for the wrong reason (compensating
errors that cancel only at low step count). Canonical muP readout: forward
multiplier 1/m, base-width init, base lr (vector-like under Adam, ninf==1).
Verified by coord check: init output slopes ~1/sqrt(m), trained output flat.
adds default-off plan_target_last_frame flag so the single-frame ViT
supervises the last plan frame; convnext unchanged when off
# Conflicts:
#	torchtitan/experiments/__init__.py
drop the Meta copyright headers, the inherited-Meta formatter churn, and the dangling
plan_vit registry entry; the vit resolves via --module path --config vit_*.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant