TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

要約

自然言語クエリに基づくビデオモーメント検索(MR)とハイライト検出(HD)は、ビデオ内の関連するモーメントと各ビデオクリップのハイライトスコアを取得することを目的とした2つの高度に関連したタスクである。近年、MRとHDを同時に解決するために、DETRベースのネットワークを構築する手法がいくつか開発されている。これらの手法は、マルチモーダル特徴抽出と特徴相互作用の後に、2つの独立したタスクヘッドを追加するだけで、良好な性能を達成する。しかしながら、これらのアプローチは2つのタスク間の相互関係を十分に活用していない。本論文では、DETRに基づくタスク相互変換器(TR-DETR)を提案する。具体的には、まずローカル-グローバルマルチモーダルアライメントモジュールを構築し、多様なモダリティからの特徴を共有潜在空間にアライメントする。続いて、視覚特徴精密化モジュールが設計され、モーダル相互作用のために視覚特徴からクエリに無関係な情報を除去する。最後に、MRとHDの相互作用を利用して、検索パイプラインとハイライトスコア予測プロセスを改良するタスク協調モジュールを構築する。QVHighlights、Charades-STA、TVSumデータセットを用いた包括的な実験により、TR-DETRが既存の最先端手法を凌駕することが実証された。コードは୧⃛(๑⃙⃘⁼̴̀꒳⁼̴́๑⃙⃘)

要約(オリジナル)

Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at \url{https://github.com/mingyao1120/TR-DETR}.

arxiv情報

著者	Hao Sun,Mingyao Zhou,Wenjing Chen,Wei Xie
発行日	2024-01-04 14:55:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー