The understanding of human-object interactions is fundamental in First Person Vision (FPV). Visual tracking algorithms which follow the objects manipulated by the camera wearer can provide useful information to effectively model such interactions. In the last years, the computer vision community has significantly improved the performance of tracking algorithms for a large variety of target objects and scenarios. Despite a few previous attempts to exploit trackers in the FPV domain, a methodical analysis of the performance of state-of-the-art trackers is still missing. This research gap raises the question of whether current solutions can be used ``off-the-shelf'' or more domain-specific investigations should be carried out. This paper aims to provide answers to such questions. We present the first systematic investigation of single object tracking in FPV. Our study extensively analyses the performance of 42 algorithms including generic object trackers and baseline FPV-specific trackers. The analysis is carried out by focusing on different aspects of the FPV setting, introducing new performance measures, and in relation to FPV-specific tasks. The study is made possible through the introduction of TREK-150, a novel benchmark dataset composed of 150 densely annotated video sequences. Our results show that object tracking in FPV poses new challenges to current visual trackers. We highlight the factors causing such behavior and point out possible research directions. Despite their difficulties, we prove that trackers bring benefits to FPV downstream tasks requiring short-term object tracking. We expect that generic object tracking will gain popularity in FPV as new and FPV-specific methodologies are investigated.

IJCV Paper (2022)

ICCVW Paper (2021)

Dataset Stats

150 Videos
97296 Frames
20 Action Verbs
34 Target Objects
958 Object Interactions
(a) Distribution of the sequences within TREK-150 with respect to the attributes describing visual changes of the target or of the scene. (b) Comparison of the distributions of common attributes in different benchmarks. Distributions of (c) action verb labels, and (d) target object categories (nouns).


Performance of 20 of the 42 selected trackers on the proposed TREK-150 benchmark under the OPE protocol. In brackets, next to the trackers’ names, we report the SS, NPS, and GSR values.
Qualitative examples of TREK-150's tracking sequences along with the results of the top-performing trackers.



  author = {Dunnhofer, Matteo and Furnari, Antonino and Farinella, Giovanni Maria and Micheloni, Christian},
  title = {Visual Object Tracking in First Person Vision},
  journal = {International Journal of Computer Vision (IJCV)},
  year = {2022}
  author = {Dunnhofer, Matteo and Furnari, Antonino and Farinella, Giovanni Maria and Micheloni, Christian},
  title = {Is First Person Vision Challenging for Object Tracking?},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
  month = {Oct},
  year = {2021}


Research at the University of Udine has been supported by the ACHIEVE-ITN H2020 project. Research at the University of Catania has been supported by MIUR AIM - Attrazione e Mobilita Internazionale Linea 1 - AIM1893589 - CUP: E64118002540007.

Copyright © Machine Learning and Perception Lab - University of Udine - 2021 - 2022

Website template was inspired by TAO Dataset.