Cross-Modal Language-Video Attention for Text-Video Retrieval

Satya Krishna Gorti1*
Noël Vouitsis1,2*
Junwei Ma1*
Keyvan Golestan1
Maksims Volkovs1
Animesh Garg2,3,4
Guangwei Yu1
1Layer 6 AI
2University of Toronto
3Vector Institute
*Authors contributed equally to this work

CVPR 2022



In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video that ranks relevant text-video pairs higher than irrelevant pairs. However, videos inherently express a much wider gamut of information than texts. Instead, texts often capture sub-regions of entire videos and are most semantically similar to certain frames within videos. Therefore, for a given text, a retrieval model should focus on the text’s most semantically similar video sub-regions to make a more relevant comparison. Yet, most existing works aggregate entire videos without directly considering text. Common text-agnostic aggregations schemes include mean-pooling or self-attention over the frames, but these are likely to encode misleading visual information not described in the given text. To address this, we propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video. Our core mechanism is a scaled dot product attention for a text to attend to its most semantically similar frames. We then generate an aggregated video representation conditioned on the text’s attention weights over the frames. We evaluate our method on three benchmark datasets of MSR-VTT, MSVD and LSMDC, achieving new state-of-the-art results by up to 12% in relative improvement in Recall@1. Our findings thereby highlight the importance of joint text-video reasoning to extract important visual cues according to text.

In the following demo, we first show a text-video pair video generated from the MSR-VTT dataset's 1K-A test split. We then show the attentions weights that X-Pool assigns to the frames of the video given the text as input. Finally, we show the top-5 retrived videos from the MSR-VTT test set when using the text as a query, and outline the ground-truth video in green if it appears in the top-5 candidates.

a person is connecting something to system


Attentions Weights Between the Text and the Video's Frames

Top-5 Retrieved Videos Given the Text as a Query