ReactSim-Bench is the first benchmark for systematicly evaluating the reactive capability of behavior world models in autonomous driving. It contains:
- Reactive closed-loop protocol with decoupled control. In ReactSim-Bench, The behavior world model controls the surrounding agents, while the autonomous vehicle (AV) are controled by its own policy instead of the world model.
- Customed AV behaviors beyond the log. ReactSim-Bench contains 2,636 scenarios with AV behaviors that differ from the log and create reactive pressure on surrounding agents. They are grouped into three categories: longitudinal,directional, and lateral deviations.
- Safety and feasibility metrics. ReactSim-Bench evaluates Agent-AV safety, agent-agent safety, map compliance, driving-direction compliance, and kinematic feasibility.
- Multiple baselines. We implement the Transformer-based (MTR), diffusion-based (CTG,VBD), and next-token-prediction-based (SMART, catk, Trajtok) behavior world models on ReactSim-Bench as baselines.
- Install the repository, download the data and preprocess: document.
- Setup the environment for each baseline and train or evalute:
- Follow the instruction to train and evaluate your own model on ReactSim-Bench
ReactSim-Bench is built on nuPlan and contain 2636 test scenarios:
| Category | Number of scenarios |
|---|---|
| Longitudinal deviation | 937 |
| Directional deviation | 799 |
| Lateral deviation | 900 |
| Total | 2,636 |
The data is available at Hugging Face.
| Method | A-AV Coll. Count | A-AV risky Count | A-A Coll. (%) | Offroad (%) | Direction violation (%) | Acceleration infeasibility (%) | Steering infeasibility (%) |
|---|---|---|---|---|---|---|---|
| Log Replay | 0.9829 | 1.5380 | 2.25 | 0.18 | 0.80 | 0.16 | 2.51 |
| MTR | 0.1457 | 0.5819 | 3.29 | 2.67 | 2.83 | 0.64 | 14.29 |
| CTG | 0.6195 | 0.9476 | 4.88 | 2.95 | 2.10 | 10.87 | 7.08 |
| VBD | 0.2276 | 0.4711 | 3.19 | 1.03 | 2.35 | 0.01 | 0.18 |
| SMART | 0.1419 | 0.3976 | 2.23 | 0.68 | 1.09 | 9.74 | 4.83 |
| CATK | 0.1426 | 0.4029 | 2.22 | 0.69 | 1.13 | 10.25 | 5.02 |
| TrajTok | 0.1407 | 0.4173 | 2.23 | 0.61 | 1.03 | 3.23 | 3.93 |
The checkpoints of baselines are available at Hugging Face.
If you find ReactSim-Bench useful, pase cite:
@article{reactsimbench,
title={ReactSim-Bench: Benchmarking Reactive Behavior World Model Simulation in Autonomous Driving},
author={Zhiyuan Zhang and Yanlun Peng and Jianing Zhang and Xianda Guo and Zehan Huang and Haoran Liu and Qifeng Li and Shaofeng Zhang and Xiaosong Jia and Junchi Yan},
year={2026},
eprint={2606.14058},
archivePrefix={arXiv},
primaryClass={cs.RO},
}