Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

CVPR

Cite Paper

Authors

Rogerio Feris
James Glass
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Nina Shvetsova
Brian Chen
David Harwath
Hildegard Kuehne

Published on

12/08/2021

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

Please cite our work using the BibTeX below.

@misc{https://doi-org.ezproxy.canberra.edu.au/10.48550/arxiv.2112.04446,
  doi = {10.48550/ARXIV.2112.04446},
  
  url = {https://arxiv.org/abs/2112.04446},
  
  author = {Shvetsova, Nina and Chen, Brian and Rouditchenko, Andrew and Thomas, Samuel and Kingsbury, Brian and Feris, Rogerio and Harwath, David and Glass, James and Kuehne, Hilde},
  
  keywords = {Computer Vision and Pattern Recognition (cs.CV), Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering},
  
  title = {Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}