Video Referring Expression Comprehension via Transformer with Content-aware Query

要約

Video Referring Expression Comprehension (REC) は自然言語表現によって参照されるビデオフレーム中の対象物を特定することを目的とする．近年、Transformerbasedの手法により性能限界が大幅に改善された。しかし、我々は現在のクエリ設計が最適ではなく、2つの欠点に悩まされていると主張する。1) 学習収束に時間がかかる、2) 微細なアライメントができない。これを緩和するため、我々は純粋に学習可能なクエリとコンテンツ情報の結合を目指す。具体的には、フレーム全体に学習可能なバウンディングボックスを一定数設定し、整列した領域の特徴を利用して、有益な手がかりを提供する。さらに、文中の特定のフレーズと意味的に関連する視覚領域を明示的に結びつける。この目的のために、我々はVIDSentenceとVidSTGデータセットに、それぞれ文中の明示的に参照される単語を追加することで、2つの新しいデータセット（すなわち、VID-EntityとVidSTG-Entity）を導入する。これにより、より詳細なクロスモーダルアライメントを領域-フレーズレベルで行い、より詳細な特徴表現が可能となる。これら2つの設計を取り入れた我々の提案モデル（ContFormer）は、広くベンチマークされたデータセットにおいて最先端の性能を達成した。例えば、VID-Entityデータセットでは、従来のSOTAと比較して、ContFormerはAccu.@0.6、8.75%の絶対的な改善を達成しています。

要約(オリジナル)

Video Referring Expression Comprehension (REC) aims to localize a target object in video frames referred by the natural language expression. Recently, the Transformerbased methods have greatly boosted the performance limit. However, we argue that the current query design is suboptima and suffers from two drawbacks: 1) the slow training convergence process; 2) the lack of fine-grained alignment. To alleviate this, we aim to couple the pure learnable queries with the content information. Specifically, we set up a fixed number of learnable bounding boxes across the frame and the aligned region features are employed to provide fruitful clues. Besides, we explicitly link certain phrases in the sentence to the semantically relevant visual areas. To this end, we introduce two new datasets (i.e., VID-Entity and VidSTG-Entity) by augmenting the VIDSentence and VidSTG datasets with the explicitly referred words in the whole sentence, respectively. Benefiting from this, we conduct the fine-grained cross-modal alignment at the region-phrase level, which ensures more detailed feature representations. Incorporating these two designs, our proposed model (dubbed as ContFormer) achieves the state-of-the-art performance on widely benchmarked datasets. For example on VID-Entity dataset, compared to the previous SOTA, ContFormer achieves 8.75% absolute improvement on Accu.@0.6.

arxiv情報

著者	Ji Jiang,Meng Cao,Tengtao Song,Yuexian Zou
発行日	2022-10-06 14:45:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Video Referring Expression Comprehension via Transformer with Content-aware Query

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー