Skip to content

Add parallel eval runner for understanding benchmarks (如果有多卡的话,可以用cli多卡并行推在理解benchmark)#5

Open
MqLeet wants to merge 1 commit intoAIFrontierLab:mainfrom
MqLeet:feat/distributed-eval-runner
Open

Add parallel eval runner for understanding benchmarks (如果有多卡的话,可以用cli多卡并行推在理解benchmark)#5
MqLeet wants to merge 1 commit intoAIFrontierLab:mainfrom
MqLeet:feat/distributed-eval-runner

Conversation

@MqLeet
Copy link
Copy Markdown

@MqLeet MqLeet commented Apr 27, 2026

Hi @ApiaoSamaa ,感谢你们开源的优秀工作!我是Jindong老师的粉丝,前几天在用torchumm来做理解任务的的bench,自己适配了一下多卡环境下的并行推理,特意提个PR支持一下工作~

Summary

Introduces a small, focused runner for distributed sharded inference and refactors the 5 understanding-class eval CLIs (mmbench, mme, mmmu, mathvista, mmvet) to use it. Lifts the distributed-init / round-robin sharding / per-rank JSONL checkpoint / rank-0 merge boilerplate into one place.

After this PR, all 5 CLIs can run under torchrun --nproc_per_node=N for data-parallel evaluation, and single-card behavior is preserved (final user-facing output files are byte-equivalent for benchmarks that already had a defined output format).

PYTHONPATH=src torchrun --nproc_per_node="${GPUS}" --master_port="${MASTER_PORT}" -m umm.cli.main eval --config configs/eval/mmbench/mmbench_show_o.yaml

What's added?

distributed runner changes

  • src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/ all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy torch import so single-card callers pay no import cost.
  • src/umm/eval/runner.py — run_sharded_inference(): round-robin sample assignment by sample_idx, per-rank JSONL shard append (flush+fsync), resume via caller-supplied done_ids, optional global max_samples cap. Accepts an infer_fn callable so the runner is unit-testable without a real model.

per-cli changes

All 5 CLIs (mmbench, mme, mmmu, mathvista, mmvet) gain parallel support and share the same shape:

  dist_info = maybe_init_distributed()
  shard = rank_shard_path(checkpoint, dist_info.rank, dist_info.world_size)
  done_ids = {... from load_shard_items(shard)} if resume else set()                                                                                  
  n = run_sharded_inference(infer_fn=pipeline.run, ...)                                                                                               
  barrier(dist_info)                                                                                                                                  
  if dist_info.rank == 0:                                                                                                                             
      merged = merge_shards(checkpoint)                                                                                                               
      # benchmark-specific output formatting
      cleanup_shards(checkpoint)

Usage

  • Single-GPU (unchanged)
    PYTHONPATH=src python -m umm.cli.main eval --config <cfg>.yaml

  • Multi-GPUs
    PYTHONPATH=src torchrun --nproc_per_node=8 -m umm.cli.main eval --config <cfg>.yaml

TIPS

我已经在mme, mmbench, mmmu, mathvista, mmvet上适配了,但是没有在生成任务上做适配。此外可能还有一些follow-up PRs:

Backbone LOCAL_RANK adaptation. 需要给所有的backnones/models/apdater.py文件做一下gpu device分配:

    def _get_runtime_device(self):
        import torch

        if not torch.cuda.is_available():
            return torch.device("cpu")
        local_rank = int(os.environ.get("LOCAL_RANK", 0))
        if 0 <= local_rank < torch.cuda.device_count():
            torch.cuda.set_device(local_rank)
            return torch.device(f"cuda:{local_rank}")
        return torch.device("cuda")

并且替换

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

device = self._get_runtime_device()

Introduces two small modules under src/umm/eval/ that lift the
distributed-init / sharding / shard-merge boilerplate out of each
understanding-class eval CLI.

* src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/
  all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy
  torch import so single-card callers pay no import cost.
* src/umm/eval/runner.py — run_sharded_inference(): round-robin sample
  assignment by sample_idx, per-rank JSONL shard append (flush+fsync),
  resume via caller-supplied done_ids, optional global max_samples cap.
  Accepts an infer_fn callable so the runner is unit-testable without a
  real model.

Refactors mmbench/mme/mmmu/mathvista/mmvet eval CLIs to use the runner.
mathvista and mmvet gain parallel support; the others have their
duplicated dist plumbing replaced.

The runner does only "shard inference + shard merge". Each CLI keeps
its post-processing (Excel/JSON output, calculation.py invocation,
mathvista's LLM extraction) behind `if rank == 0:`.

Behavior in single-card mode: final user-facing outputs are identical.
Mid-run checkpoint format changes (mme TSV→JSONL during run, mathvista
and mmvet dict-JSON→JSONL) so a partial run with the prior code cannot
be resumed by this code; fresh runs work identically.

Out of scope (follow-up PRs): each backbone adapter's LOCAL_RANK
handling — only show_o currently honors LOCAL_RANK, others default to
cuda:0 or device_map="auto" and need adaptation before they work
correctly under torchrun multi-rank.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ApiaoSamaa ApiaoSamaa self-assigned this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants