ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

要約

テキストベースのビデオセグメンテーションは、ビデオ内の自然言語参照オブジェクトをセグメント化する難しいタスクです。
基本的に、意味の理解とビデオの詳細な理解が必要です。
既存の方法は、言語表現をボトムアップ方式でセグメンテーションモデルに導入します。これは、ConvNet のローカル受容野内で視覚と言語の相互作用を行うだけです。
モデルは部分的な観察を与えられた場合に領域レベルの関係をほとんど構築できないため、このような相互作用は満たされないと主張します。これは、自然言語/参照表現の記述ロジックに反します。
実際、人々は通常、他のオブジェクトとの関係を使用してターゲットオブジェクトを説明しますが、ビデオ全体を見ないと簡単に理解できない場合があります。
この問題に対処するために、人間が言語ガイダンスを使用してオブジェクトをセグメント化する方法を模倣することにより、新しいトップダウンアプローチを導入します。
まずビデオ内のすべての候補オブジェクトを特定し、次にそれらの高レベルのオブジェクト間の関係を解析することによって、参照されたオブジェクトを選択します。
関係を正確に理解するために、3種類のオブジェクトレベルの関係、すなわち、位置関係、テキストガイドによる意味関係、および時間関係を調査します。
A2D センテンスと J-HMDB センテンスに関する広範な実験により、私たちの方法が最先端の方法を大幅に上回ることが示されました。
定性的結果は、結果がより説明可能であることも示しています。

要約(オリジナル)

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

arxiv情報

著者	Chen Liang,Yu Wu,Yawei Luo,Yi Yang
発行日	2024-01-19 14:43:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー