OSMO: Open-vocabulary Self-eMOtion Tracking

1Meta Reality Labs 2Ecole Polytechnique Federale de Lausanne (EPFL)
Work done during an internship at Meta.
Teaser

OSIRIS tracks personalized emotions over time using egocentric feed from smart glasses.

Abstract

We introduce the novel task of egocentric self-emotion tracking, which aims to infer an individual's evolving emotions from egocentric multimodal streams such as voice, visual surroundings, semantic subtext, and eye-tracking signals.

To establish this research direction, we present: (1) OSMO dataset, a large-scale annotation effort on 110 hours of existing bilingual smart-glasses recordings, establishing the largest egocentric emotion dataset and the first with subject-wise emotion timelines; (2) OSMO benchmark, a suite of five tasks (emotion recognition, sentiment, intensity, localization, and reasoning), that redefine emotion understanding as a continuous, context-aware process rather than discrete classification of trimmed videos; (3) OSIRIS, a large multimodal model that tracks emotions over time by reasoning over the user's personal emotion history, current expressions, and egocentric observations.

Extensive evaluations show that OSIRIS achieves a state-of-the-art performance, delivering, for the first time, coherent emotion timelines from egocentric data. Dataset, model, and codes will be fully open-sourced upon publication.

Citation

If you find this work useful, please cite:

@inproceedings{abdelfattah2026osmo,
  title={OSMO: Open-vocabulary Self-eMOtion Tracking},
  author={Abdelfattah, Mohamed and Tekin, Bugra and Sener, Fadime and Camgoz, Necati Cihan and Sauser, Eric and Ma, Shugao and Alahi, Alexandre and Remelli, Edoardo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}