Add parallel eval runner for understanding benchmarks (如果有多卡的话，可以用cli多卡并行推在理解benchmark) by MqLeet · Pull Request #5 · AIFrontierLab/TorchUMM

MqLeet · 2026-04-27T05:59:45Z

Hi @ApiaoSamaa ，感谢你们开源的优秀工作！我是Jindong老师的粉丝，前几天在用torchumm来做理解任务的的bench，自己适配了一下多卡环境下的并行推理，特意提个PR支持一下工作～

Summary

Introduces a small, focused runner for distributed sharded inference and refactors the 5 understanding-class eval CLIs (mmbench, mme, mmmu, mathvista, mmvet) to use it. Lifts the distributed-init / round-robin sharding / per-rank JSONL checkpoint / rank-0 merge boilerplate into one place.

After this PR, all 5 CLIs can run under torchrun --nproc_per_node=N for data-parallel evaluation, and single-card behavior is preserved (final user-facing output files are byte-equivalent for benchmarks that already had a defined output format).

PYTHONPATH=src torchrun --nproc_per_node="${GPUS}" --master_port="${MASTER_PORT}" -m umm.cli.main eval --config configs/eval/mmbench/mmbench_show_o.yaml

What's added?

distributed runner changes

src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/ all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy torch import so single-card callers pay no import cost.
src/umm/eval/runner.py — run_sharded_inference(): round-robin sample assignment by sample_idx, per-rank JSONL shard append (flush+fsync), resume via caller-supplied done_ids, optional global max_samples cap. Accepts an infer_fn callable so the runner is unit-testable without a real model.

per-cli changes

All 5 CLIs (mmbench, mme, mmmu, mathvista, mmvet) gain parallel support and share the same shape:

  dist_info = maybe_init_distributed()
  shard = rank_shard_path(checkpoint, dist_info.rank, dist_info.world_size)
  done_ids = {... from load_shard_items(shard)} if resume else set()                                                                                  
  n = run_sharded_inference(infer_fn=pipeline.run, ...)                                                                                               
  barrier(dist_info)                                                                                                                                  
  if dist_info.rank == 0:                                                                                                                             
      merged = merge_shards(checkpoint)                                                                                                               
      # benchmark-specific output formatting
      cleanup_shards(checkpoint)

Usage

Single-GPU (unchanged)
PYTHONPATH=src python -m umm.cli.main eval --config <cfg>.yaml
Multi-GPUs
PYTHONPATH=src torchrun --nproc_per_node=8 -m umm.cli.main eval --config <cfg>.yaml

TIPS

我已经在mme, mmbench, mmmu, mathvista, mmvet上适配了，但是没有在生成任务上做适配。此外可能还有一些follow-up PRs:

Backbone LOCAL_RANK adaptation. 需要给所有的backnones/models/apdater.py文件做一下gpu device分配：

    def _get_runtime_device(self):
        import torch

        if not torch.cuda.is_available():
            return torch.device("cpu")
        local_rank = int(os.environ.get("LOCAL_RANK", 0))
        if 0 <= local_rank < torch.cuda.device_count():
            torch.cuda.set_device(local_rank)
            return torch.device(f"cuda:{local_rank}")
        return torch.device("cuda")

并且替换

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

为

device = self._get_runtime_device()

Introduces two small modules under src/umm/eval/ that lift the distributed-init / sharding / shard-merge boilerplate out of each understanding-class eval CLI. * src/umm/eval/distributed.py — DistInfo dataclass, dist init/barrier/ all-reduce, rank-shard path, glob-based shard merge/cleanup. Lazy torch import so single-card callers pay no import cost. * src/umm/eval/runner.py — run_sharded_inference(): round-robin sample assignment by sample_idx, per-rank JSONL shard append (flush+fsync), resume via caller-supplied done_ids, optional global max_samples cap. Accepts an infer_fn callable so the runner is unit-testable without a real model. Refactors mmbench/mme/mmmu/mathvista/mmvet eval CLIs to use the runner. mathvista and mmvet gain parallel support; the others have their duplicated dist plumbing replaced. The runner does only "shard inference + shard merge". Each CLI keeps its post-processing (Excel/JSON output, calculation.py invocation, mathvista's LLM extraction) behind `if rank == 0:`. Behavior in single-card mode: final user-facing outputs are identical. Mid-run checkpoint format changes (mme TSV→JSONL during run, mathvista and mmvet dict-JSON→JSONL) so a partial run with the prior code cannot be resumed by this code; fresh runs work identically. Out of scope (follow-up PRs): each backbone adapter's LOCAL_RANK handling — only show_o currently honors LOCAL_RANK, others default to cuda:0 or device_map="auto" and need adaptation before they work correctly under torchrun multi-rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ApiaoSamaa self-assigned this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel eval runner for understanding benchmarks (如果有多卡的话，可以用cli多卡并行推在理解benchmark)#5

Add parallel eval runner for understanding benchmarks (如果有多卡的话，可以用cli多卡并行推在理解benchmark)#5
MqLeet wants to merge 1 commit intoAIFrontierLab:mainfrom
MqLeet:feat/distributed-eval-runner

MqLeet commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MqLeet commented Apr 27, 2026

Summary

What's added?

distributed runner changes

per-cli changes

Usage

TIPS

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants