Skip to content

Conversation

@simondong1
Copy link

@simondong1 simondong1 commented Jan 27, 2026

Issue

When using recent Megatron-LM versions (since Dec 10, 2025, NVIDIA/Megatron-LM#2608), model initialization fails with unexpected keyword argument 'pg_collection' or config because the upstream megatron get_model interface now passes these arguments explicitly. This causes convert_hf_to_torch_dist.py to fail.

from megatron

def get_model(
    model_provider_func,
    model_type=ModelType.encoder_or_decoder,
    wrap_with_ddp=True,
    config=None, # <--- NEW PARAMETER
    pg_collection=None # <--- NEW PARAMETER
):

Fix

  • Custom Providers: Updated wrapped_model_provider to dynamically inspect signatures and only pass pg_collection, vp_stage, or config if the custom function accepts them.
  • Standard Provider: Updated model_provider to accept pg_collection and config and forward pg_collection to GPTModel kwargs for correct distributed initialization.

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

An error occurred while converting the Huggingface to a form that can be loaded by Megatron.

1 participant