Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection

要約

オブジェクト分類法のスケーリングは、認識システムの堅牢な現実世界への展開に向けた重要なステップの 1 つです。
LVIS ベンチマークの導入以来、画像は目覚ましい進歩を遂げてきました。
動画でのこの成功を継続するために、新しい動画ベンチマークである TAO が最近発表されました。
検出コミュニティと追跡コミュニティの両方から得られた最近の有望な結果を踏まえて、私たちはこれら 2 つの進歩を結び付けて、強力で大語彙のビデオトラッカーを構築することに関心があります。
ただし、LVIS と TAO の監督は本質的にまばらであるか、欠落していることさえあり、大語彙トラッカーのトレーニングに 2 つの新しい課題をもたらします。
まず、LVIS には追跡管理機能がないため、検出 (LVIS および TAO を使用) と追跡 (TAO のみを使用) の学習に一貫性がありません。
第 2 に、TAO の検出監視は部分的であり、その結果、ビデオの微調整中に存在しない LVIS カテゴリが壊滅的に忘れられます。
これらの課題を解決するために、利用可能なすべてのトレーニングデータを最大限に活用して検出と追跡を学習しながら、認識すべき LVIS カテゴリを失うことのないシンプルで効果的な学習フレームワークを提示します。
この新しい学習スキームにより、さまざまな大語彙トラッカーの一貫した改善が可能であり、困難な TAO ベンチマークで強力なベースライン結果を設定できることを示します。

要約(オリジナル)

Scaling object taxonomies is one of the important steps toward a robust real-world deployment of recognition systems. We have faced remarkable progress in images since the introduction of the LVIS benchmark. To continue this success in videos, a new video benchmark, TAO, was recently presented. Given the recent encouraging results from both detection and tracking communities, we are interested in marrying those two advances and building a strong large vocabulary video tracker. However, supervisions in LVIS and TAO are inherently sparse or even missing, posing two new challenges for training the large vocabulary trackers. First, no tracking supervisions are in LVIS, which leads to inconsistent learning of detection (with LVIS and TAO) and tracking (only with TAO). Second, the detection supervisions in TAO are partial, which results in catastrophic forgetting of absent LVIS categories during video fine-tuning. To resolve these challenges, we present a simple but effective learning framework that takes full advantage of all available training data to learn detection and tracking while not losing any LVIS categories to recognize. With this new learning scheme, we show that consistent improvements of various large vocabulary trackers are capable, setting strong baseline results on the challenging TAO benchmarks.

arxiv情報

著者	Sanghyun Woo,Kwanyong Park,Seoung Wug Oh,In So Kweon,Joon-Young Lee
発行日	2022-12-20 10:33:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridging Images and Videos: A Simple Learning Framework for Large Vocabulary Video Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー