Kehan Lan Kaining Ying Henghui Ding ✉️
Fudan University, China
ECCV 2026
Existing Referring Video Object Segmentation tasks focus on referring expressions describing events, actions or appearances of relevant objects within the observed frames, lacking evaluation in scenarios that require pre-decisive spatio-temporal reasoning, thereby limiting their applicability. To address this, we propose Foresight Expression Video Object Segmentation, a task that queries future events in upcoming video segments and requires masks of the objects in the observed frames as visual answers. For example, in ego-centric scenes, the question "What tool will be used?" demands reasoning over spatio-temporal cues to predict the masks of the next tool to be used, which helps with the understanding of future actions and decisions. To support this task, we introduce FeVOS, a dataset with 968 video clips, 14,525 foresight expressions, and 2,904 chain-of-thought annotations to provide explicit and interpretable reasoning steps. We further develop FeVOS-R1, an MLLM-based model trained on our dataset via a two-stage pipeline of supervised fine-tuning and reinforcement learning. FeVOS-R1 not only achieves state-of-the-art performance on FeVOS, but also demonstrates strong generalization to existing RVOS benchmarks. We hope this work can inspire more research on predictive reasoning in video perception.
-
Release the dataset and model on Huggingface.
-
Release the complete training configuration and code.
conda env create -f environment.yml
conda activate vlm
mkdir -p data
mkdir -p modelsPlease download the complete benchmark from huggingface 🤗. And put it in data/.
To evaluate our baseline model FeVOS-R1, firstly download the model from huggingface 🤗. And put it in models/
Then, run the following command to evaluate the model on the FeVOS benchmark:
bash scripts/reproduce_eval.shPreparing.
We would like to express our gratitude to some other projects that have contributed to our work:
If you find our paper and dataset useful for your research, please generously cite our paper.
@inproceedings{lan2026fevos,
title={FeVOS: Foresight Expression Video Object Segmentation},
author={Lan, Kehan and Ying, Kaining and Ding, Henghui},
booktitle={ECCV},
year={2026}
}
