MTGS: A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction
Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, Jean-Marc Odobez
NeurIPS 2024
[Paper] [Video] [Dataset]
This repository provides the official code and checkpoints for the MTGS model, as introduced in our NeurIPS paper, MTGS: A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction.
MTGS jointly models gaze following and social gaze prediction in multi-person scenarios. It leverages temporal information and interactions among individuals to enhance gaze prediction accuracy.
We release an improved version of MTGS that uses a frozen DINOv2 encoder, providing significant gains over the original model and surpassing recent state-of-the-art methods.
Results on GazeFollow test set:
| Method | Resolution | AUC | Avg. Dist. | Min. Dist. |
|---|---|---|---|---|
| Gaze-LLE [CVPR'25] | 448x448 | 0.956 | 0.104 | 0.045 |
| MTGS-static [NeurIPS'24] | 224x224 | 0.929 | 0.116 | 0.059 |
| MTGS-DINO-static | 224x224 | 0.944 | 0.101 | 0.045 |
| MTGS-DINO-static | 448x448 | 0.949 | 0.098 | 0.043 |
Results on the VideoAttentionTarget test set:
| Method | Resolution | Dist. | AP_IO |
|---|---|---|---|
| Gaze-LLE [CVPR'25] | 448x448 | 0.107 | 0.897 |
| MTGS-static [NeurIPS'24] | 224x224 | 0.114 | 0.843 |
| MTGS (temporal) [NeurIPS'24] | 224x224 | 0.112 | 0.845 |
| MTGS-DINO-static | 224x224 | 0.102 | 0.852 |
| MTGS-DINO-static | 448x448 | 0.096 | 0.878 |
| MTGS-DINO (temporal) | 448x448 | 0.096 | 0.888 |
Results on the VSGaze test set:
| Method | Resolution | Dist. | AP_IO | F1_LAH (PP) | F1_LAEO (PP) | AP_SA |
|---|---|---|---|---|---|---|
| MTGS-static [NeurIPS'24] | 224x224 | 0.108 | 0.946 | 0.806 | 0.599 | 0.555 |
| MTGS (temporal) [NeurIPS'24] | 224x224 | 0.107 | 0.940 | 0.812 | 0.603 | 0.576 |
| MTGS-DINO-static | 224x224 | 0.096 | 0.951 | 0.833 | 0.624 | 0.551 |
| MTGS-DINO-static | 448x448 | 0.089 | 0.960 | 0.832 | 0.621 | 0.595 |
| MTGS-DINO (temporal) | 448x448 | 0.087 | 0.961 | 0.837 | 0.626 | 0.596 |
If you use this updated version, please reference these improved results in your work.
Follow the instructions on the VSGaze dataset page to download and prepare the data.
After downloading, update the paths in mtgs/config/config.yaml:
- Set
ann_rootto the directory containing the annotation H5 files. - Set
rootto the root path of the images for each dataset.
We use PyTorch for our experiments.
Create a conda environment:
conda create -n mtgs-env python=3.8
conda activate mtgs-env
pip install -r requirements.txtThen install the MTGS package. Go to the root directory and run:
pip install -e .NOTE: All scripts are inside the scripts folder.
- Update
YOLO_CHECKPOINT_FILEto point to the head detection model checkpoint (we use a pre-trained YOLOv5 model trained on the CrowdHuman dataset). - Also update
MTGS_CHECKPOINT_FILEwith the path to themtgs-static-vsgazemodel, and set the paths toVIDEO_FILEandOUTPUT_FOLDER.
bash demo.shUpdate GAZE_WEIGHTS to point to the pre-trained gaze backbone:
bash train_gazefollow.shUpdate WEIGHTS to point to the GazeFollow pre-trained checkpoint:
bash train_vsgaze.shThis trains the temporal MTGS model.
NOTE:
- You can also train on individual datasets in VSGaze by setting the
experiment.datasetvariable tochildplay,videocoatt,uco_laeoorvat. - Depending on batch size and dataset, you may need to update the
data.num_stepsvariable to ensure a smooth learning rate schedule.
Update the TEST_CHECKPOINT path:
bash test_vat.shThis performs evaluation on VideoAttentionTarget. To perform evaluation on a different dataset, set DATASET to one of gazefollow, vsgaze, childplay, videocoatt, uco_laeo or vat.
We provide additional scripts for computing in-out, post-processing metrics under mtgs/performance/.
Pre-trained models are available on Huggingface.
| Model Name | Download Link | Description |
|---|---|---|
mtgs-static-gazefollow |
Link | MTGS model trained on GazeFollow without temporal modeling. |
mtgs-static-vsgaze |
Link | MTGS model trained on VSGaze without temporal modeling. |
mtgs-vsgaze |
Link | MTGS model trained on VSGaze with temporal modeling. |
resnet18-gaze360 |
Link | ResNet-18 trained on Gaze360, used for initializing the gaze backbone. |
If you use our code, please cite:
@article{gupta2024mtgs,
title={MTGS: A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction},
author={Gupta, Anshul and Tafasca, Samy and Farkhondeh, Arya and Vuillecard, Pierre and Odobez, Jean-Marc},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={15646--15673},
year={2024}
}Parts of the code are adapted from the ViT-Adapter codebase. The demo code uses a pre-trained YOLOv5 based head detection model.
The code template was based on Sharingan.
We thank the authors for their contributions.
This code is released under the GPL-3.0-only license. See the LICENSE file for more details. Please also see the DEPENDENCIES file for license information regarding third-party software used in this project.

