A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

要約

タイトル：読解力を含んだ大規模なクロスモーダル動画検索データセット
要約：
– テキストは人間の環境で広く使われ、ビデオの理解に頻繁に必要である。
– このように、ビジュアルとテキストの意味表現を両方含む動画の検索を研究するために、テキスト読解力を含む大規模なクロスモーダル動画検索データセット、TextVRを紹介する。
– TextVRには、8つのシナリオ領域（屋内のストリートビュー、屋外のストリートビュー、ゲーム、スポーツ、運転、アクティビティ、テレビ番組、料理）の10.5kビデオに対して42.2kの文のクエリが含まれる。
– 提案されたTextVRは、テキストを認識し理解し、ビジュアルコンテキストに関連づけ、ビデオ検索タスクに必要な意味情報を決定するために、1つの統一されたクロスモーダルモデルを必要とする。
– さらに、既存のデータセットと比較したTextVRの詳細な分析を行い、テキストベースのビデオ検索タスクのための新しいマルチモーダルビデオ検索ベースラインの設計を行った。
– データセット分析と広範な実験により、TextVRベンチマークが、ビデオと言語コミュニティにとって以前のデータセットから多くの新しい技術的課題と洞察を提供していることが示された。
– プロジェクトのウェブサイトとGitHubリポジトリはそれぞれhttps://sites.google.com/view/loveucvpr23/guest-trackとhttps://github.com/callsys/TextVRである。

要約(オリジナル)

Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i.e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video. To study how to retrieve video with both modal inputs, i.e., visual and text semantic representations, we first introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR, which contains 42.2k sentence queries for 10.5k videos of 8 scenario domains, i.e., Street View (indoor), Street View (outdoor), Games, Sports, Driving, Activity, TV Show, and Cooking. The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task. Besides, we present a detailed analysis of TextVR compared to the existing datasets and design a novel multimodal video retrieval baseline for the text-based video retrieval task. The dataset analysis and extensive experiments show that our TextVR benchmark provides many new technical challenges and insights from previous datasets for the video-and-language community. The project website and GitHub repo can be found at https://sites.google.com/view/loveucvpr23/guest-track and https://github.com/callsys/TextVR, respectively.

arxiv情報

著者	Weijia Wu,Yuzhong Zhao,Zhuang Li,Jiahong Li,Hong Zhou,Mike Zheng Shou,Xiang Bai
発行日	2023-05-05 08:00:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー