Unsupervised Open-Vocabulary Object Localization in Videos

要約

この論文では、ビデオ表現学習と事前トレーニングされた視覚言語モデルの最近の進歩により、自己教師付きビデオオブジェクトの位置特定が大幅に改善されることを示します。
我々は、最初にスロットアテンションを使用したオブジェクト中心のアプローチによってビデオ内のオブジェクトの位置を特定し、次に取得したスロットにテキストを割り当てる方法を提案します。
後者は、事前トレーニングされた CLIP モデルから局所的なセマンティック情報を読み取る教師なしの方法によって実現されます。
結果として得られるビデオオブジェクトの位置特定は、CLIP に含まれる暗黙的なアノテーションを除いて完全に教師なしであり、事実上、通常のビデオベンチマークで良好な結果が得られる最初の教師なしアプローチです。

要約(オリジナル)

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

arxiv情報

著者	Ke Fan,Zechen Bai,Tianjun Xiao,Dominik Zietlow,Max Horn,Zixu Zhao,Carl-Johann Simon-Gabriel,Mike Zheng Shou,Francesco Locatello,Bernt Schiele,Thomas Brox,Zheng Zhang,Yanwei Fu,Tong He
発行日	2024-06-26 16:26:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unsupervised Open-Vocabulary Object Localization in Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー