Jinyu Liu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang
Fudan University
TL;DR: Unison evaluates Unified Multimodal Models (UMMs) by leveraging the synergy between understanding and generation capabilities across four comprehensive dimensions. The automatic evaluation model Unison-Judge achieves an 88.7% alignment with human judgments.
- [2026/06/26] Annotations about human consistency are released.
- [2026/06/25] Unison-Bench and Unison-Judge are released.
- Inference and evaluation scripts
- Unison Benchmark data and Unison-Judge model weights
- The UMM toolkit TorchUMM will support Unison in the last few days
- Evaluation results for more recent open-source models (Emu3.5, Ovis-U1, Ming series etc.) and the latest closed-source models (GPT-5.5 and Gemini 3.1 series)
π¬ Contact: If you have any questions, feel free to contact us at liujy24@m.fudan.edu.cn.
| Model | Params | Internal Consistency | Und.-Guided Gen. | Gen-Guided Und. | Mutual Enhancement | Overall | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Und. | Gen. | Uni. | Und. | Gen. | Uni. | Und. | Gen. | Uni. | Und. | Gen. | Uni. | |||
| Show-o | 1.3B | 88.3 | 64.7 | 58.5 | 8.90 | - | - | 12.0 | - | - | - | - | - | - |
| Janus-Pro | 1.5B | 94.4 | 47.1 | 45.0 | 0.3 | - | - | 19.2 | - | - | - | - | - | - |
| Show-o2 | 1.5B | 96.0 | 67.9 | 65.8 | 26.7 | - | - | 9.4 | - | - | - | - | - | - |
| D-DiT | 2B | 86.5 | 65.0 | 58.1 | 0.2 | - | - | 6.8 | - | - | - | - | - | - |
| ILLUME+ | 3B | 43.4 | 19.9 | 10.5 | 10.3 | 7.7 | 9.0 | 11.3 | 30.1 | 15.1 | 1.0 | 5.5 | 3.2 | 9.4 |
| Janus-Pro | 7B | 95.7 | 71.7 | 69.8 | 3.2 | - | - | 15.1 | - | - | - | - | - | - |
| Show-o2 | 7B | 97.2 | 73.8 | 72.5 | 9.9 | - | - | 9.2 | - | - | - | - | - | - |
| ILLUME+ | 7B | 80.2 | 20.4 | 16.7 | 12.4 | 10.4 | 11.4 | 11.3 | 27.7 | 13.9 | 2.7 | 6.8 | 4.8 | 11.7 |
| OmniGen2 π₯ | 7B | 92.3 | 79.0 | 74.5 | 61.3 | 42.6 | 52.0 | 19.7 | 41.9 | 30.9 | 45.0 | 50.3 | 47.7 | 51.3 |
| TokenFlow | 14B | 93.0 | 47.1 | 44.5 | 20.1 | - | - | 17.0 | - | - | - | - | - | - |
| BAGEL π₯ | 14B | 96.0 | 82.5 | 80.3 | 57.6 | 78.1 | 67.9 | 28.2 | 41.6 | 32.0 | 7.2 | 57.7 | 32.5 | 53.2 |
| SEED-X | 17B | 82.8 | 38.9 | 34.2 | 18.6 | 13.7 | 16.1 | 13.5 | 27.4 | 20.8 | 0.2 | 16.8 | 8.5 | 19.9 |
| UniWorld-V1 π₯ | 19B | 92.6 | 68.5 | 65.1 | 63.4 | 26.4 | 44.9 | 22.8 | 32.0 | 26.9 | 46.4 | 16.2 | 31.3 | 42.1 |
| Model | Params | Internal Consistency | Und.-Guided Gen. | Gen-Guided Und. | Mutual Enhancement | Overall | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Und. | Gen. | Uni. | Und. | Gen. | Uni. | Und. | Gen. | Uni. | Und. | Gen. | Uni. | |||
| Gemini 3 Pro | - | 98.3 | 88.1 | 86.9 | 71.0 | 82.8 | 76.9 | 42.2 | 46.5 | 43.9 | 65.3 | 77.4 | 71.4 | 69.8 |
| GPT-5.2 | - | 98.6 | 86.3 | 84.7 | 69.7 | 85.7 | 77.7 | 44.8 | 58.2 | 52.7 | 69.1 | 71.2 | 70.2 | 71.3 |
Download Unison-Bench from HuggingFace into data/ at the repo root:
huggingface-cli download FudanCVL/Unison \
--repo-type dataset --local-dir data/The expected layout:
Unison/
βββ data/
βββ Internal_Consistency/
βββ Und_Guided_Gen/
βββ Gen_Guided_Und/
βββ Mutual_Enhancement/
Both launch scripts default to DATA_DIR=../data, so no extra flags are needed. To use a different path, pass --data-dir /path/to/data or set DATA_DIR.
cd Inference_Pipeline
UMM=/data/Unified_Models ./setup_envs.sh baseCreates the unison conda env from the root requirements.txt. This env covers both the inference and the evaluation pipeline.
# All models at once
UMM=/data/Unified_Models ./setup_envs.sh
# Or selected models
UMM=/data/Unified_Models ./setup_envs.sh bagel omnigen2| Group | conda env | Upstream repo |
|---|---|---|
bagel |
bagel |
ByteDance-Seed/Bagel |
janus |
janus |
deepseek-ai/Janus |
omnigen2 |
omnigen2 |
VectorSpaceLab/OmniGen2 |
seedx |
seedx |
AILab-CVC/SEED-X |
showo |
showo2 |
showlab/Show-o |
tokenflow |
tokenflow |
ByteVisionLab/TokenFlow |
uniworld |
univa |
PKU-YuanGroup/UniWorld |
illume |
illume |
illume-unified-mllm/ILLUME_plus |
ddit |
d-dit |
zijieli-Jlee/Dual-Diffusion |
Each group clones its upstream repo into $UMM/<Repo> and installs it into the corresponding conda env. The script is idempotent; logs go to setup_logs/.
Model configs in Inference_Pipeline/config/*.json reference local weight paths using the placeholder root /path/to/Unified_Models/.... Edit each config to point at your local checkout, e.g.:
{
"model_name": "UniWorld-V1",
"model_path": "/path/to/Unified_Models/UniWorld/UniWorld-V1/model_weights/UniWorld-V1",
"api_type": "uniworld",
"conda_env": "univa",
"capabilities": ["understanding", "generation", "editing"]
}download_weights.sh fetches weights for all model backends. Set the local weight root and pick models:
UMM=/data/Unified_Models ./download_weights.sh # everything
UMM=/data/Unified_Models ./download_weights.sh bagel showo1 # selected groupsGated repos (FLUX.1-dev, SD3) need huggingface-cli login + license acceptance. Run setup_envs.sh and download_weights.sh with the same UMM so code and weights share one root.
The default evaluation backend runs Unison-Judge.
Where to put it: download the checkpoint into Evaluation_Pipeline/unison-judge/. That is the default path used by evaluate_unison.py and run_evaluate_unison.sh, so no flags are needed:
Unison/
βββ Evaluation_Pipeline/
βββ unison-judge/ # <- put Unison-Judge weights here
βββ config.json
βββ model-*.safetensors
βββ ...
To keep it elsewhere, set LOCAL_JUDGE_MODEL=/path/to/judge or pass --local-model-path /path/to/judge. No local judge weights are needed when using the api backend.
cd Inference_Pipeline
# Run all tasks on one model
GPUS=0,1,2,3,4,5,6,7 MODELS=BAGEL-7B-MoT TASKS=IC,UGG,GGU,ME ./run.sh
# Select tasks or test with 2 items
GPUS=0,1,2,3 MODELS=UniWorld-V1 TASKS=IC,GGU ./run.sh
GPUS=0 MODELS=Janus-Pro-7B TEST_MODE=true ./run.shResults are written to result/<ModelName>/<TaskID>/<TaskID>_<ModelName>_results.csv.
cd Evaluation_Pipeline
# Local judge (default) β uses Unison-Judge weights
GPU_IDS=0,1,2,3 MODELS=BAGEL-7B-MoT ./run_evaluate_unison.sh
# Select tasks or evaluate several models at once
MODELS=BAGEL-7B-MoT TASKS=IC,GGU ./run_evaluate_unison.sh
MODELS="BAGEL-7B-MoT,UniWorld-V1" GPU_IDS=0,1,2,3,4,5,6,7 ./run_evaluate_unison.sh
# Closed-source model API judge
JUDGE_BACKEND=api OPENAI_API_KEY=sk-... MODELS=UniWorld-V1 ./run_evaluate_unison.sh
# Aggregate results across models
python aggregate_results.py # -> evaluation_summary.jsonOutput per model: eval_<ModelName>.json.
We sincerely thank the open-source community for their outstanding contributions. Unison-Judge is built upon Qwen3-VL. The evaluated models, including BAGEL, UniWorld, OmniGen2, Show-o, Janus-Pro, SEED-X, TokenFlow, ILLUME+, and D-DiT et al., form the foundation of this benchmark. We are grateful to all the authors for making their work publicly available.
If you find this work useful, please cite:
@inproceedings{liu2026unison,
title = {{Unison}: Benchmarking Unified Multimodal Models via Synergistic Understanding and Generation},
author = {Liu, Jinyu and Shuai, Xincheng and Ding, Henghui and Jiang, Yu-Gang},
booktitle = {International Conference on Machine Learning},
year = {2026}
}