Skip to content

mup: model-agnostic muP sweep/scale routine#12

Draft
utkarshgill wants to merge 1 commit into
commaai:mainfrom
utkarshgill:mup-routine
Draft

mup: model-agnostic muP sweep/scale routine#12
utkarshgill wants to merge 1 commit into
commaai:mainfrom
utkarshgill:mup-routine

Conversation

@utkarshgill

Copy link
Copy Markdown

torchtitan experiment that drives a muP learning-rate sweep for any model and reports the transferred lr plus a width-scaling loss predictor, so the routine lives in the torchtitan code path instead of a per-user project script.

  • spec.py: MuPSweepSpec (config-name and training-id schemes, per-user report_dir) and SPECS for plan_vit (ready), convnext and fastvit (ready=False until their muP configs land).
  • routine.py: collect final losses from reporterv2, hp_table (the transferred lr), fit_predictor (loss(w) = L_inf + A*w^-alpha), build_report (plain plotly html, no project-specific infra).
  • main.py: python -m torchtitan.experiments.mup grid <model> prints the launch grid for any launcher to submit; report <model> collects and writes the report and prints the transferred lr plus the predicted loss.

report_dir defaults per-user (getpass) so this is not bound to one report mount; override with MUP_REPORT_DIR. submission stays the caller's job, no cluster coupling.

A torchtitan experiment that drives a muP learning-rate sweep for any model and reports the
transferred lr plus a width-scaling loss predictor, so the routine lives in the torchtitan code
path instead of a per-user project script.

- spec.py: MuPSweepSpec (config-name and training-id schemes, per-user report_dir) and SPECS for
  plan_vit (ready), convnext and fastvit (ready=False until their muP configs land).
- routine.py: collect final losses from reporterv2, hp_table (the transferred lr), fit_predictor
  (loss(w) = L_inf + A*w^-alpha), build_report (plain plotly html, no project-specific infra).
- __main__.py: `python -m torchtitan.experiments.mup grid <model>` prints the launch grid for any
  launcher to submit; `report <model>` collects and writes the report and prints the transferred
  lr plus the predicted loss.

report_dir defaults per-user (getpass) so this is not bound to one report mount; override with
MUP_REPORT_DIR. submission stays the caller's job, no cluster coupling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant