Global Prototype Encoding for Incremental Video Highlights Detection

要約

ビデオハイライトの検出は、コンピュータービジョンタスクのトピックとして長い間研究されてきました。未公開の未加工のビデオ入力が与えられた場合に、ユーザーにアピールするクリップを掘り出します。
ただし、ほとんどの場合、この一連の研究の主流の方法は、固定数のハイライトカテゴリが事前に適切に定義され、すべてのトレーニングデータが同時に利用可能である必要がある閉世界の仮定に基づいて構築されています。
その結果、ハイライトカテゴリとデータセットのサイズの両方に関してスケーラビリティが低下します。
上記の問題に取り組むために、段階的に学習できるビデオハイライト検出器、つまり \textbf{G}lobal \textbf{P}rototype \textbf{E}ncoding (GPE) を提案します。
対応するプロトタイプを介して拡張されたデータセット。
同時に、\emph{ByteFood} と呼ばれる十分に注釈が付けられた高価なデータセットを提示します。これには、\emph{cooking}、\emph{eating}、\emph{food material} の 4 つの異なるドメインに属する 5.1k 以上のグルメビデオが含まれます
それぞれ\emph{プレゼンテーション}。
私たちの知る限りでは、ビデオハイライト検出にインクリメンタルラーニング設定が導入されたのはこれが初めてです。これにより、ビデオ入力のトレーニングの負担が軽減され、データセットのサイズに比例して従来のニューラルネットワークのスケーラビリティが促進されます。
そしてドメインの量。
さらに、提案された GPE は、\emph{ByteFood} の現在の増分学習法を上回り、少なくとも 1.57\% mAP の改善を報告しています。
コードとデータセットはすぐに利用できるようになります。

要約(オリジナル)

Video highlights detection has been long researched as a topic in computer vision tasks, digging the user-appealing clips out given unexposed raw video inputs. However, in most case, the mainstream methods in this line of research are built on the closed world assumption, where a fixed number of highlight categories is defined properly in advance and need all training data to be available at the same time, and as a result, leads to poor scalability with respect to both the highlight categories and the size of the dataset. To tackle the problem mentioned above, we propose a video highlights detector that is able to learn incrementally, namely \textbf{G}lobal \textbf{P}rototype \textbf{E}ncoding (GPE), capturing newly defined video highlights in the extended dataset via their corresponding prototypes. Alongside, we present a well annotated and costly dataset termed \emph{ByteFood}, including more than 5.1k gourmet videos belongs to four different domains which are \emph{cooking}, \emph{eating}, \emph{food material}, and \emph{presentation} respectively. To the best of our knowledge, this is the first time the incremental learning settings are introduced to video highlights detection, which in turn relieves the burden of training video inputs and promotes the scalability of conventional neural networks in proportion to both the size of the dataset and the quantity of domains. Moreover, the proposed GPE surpasses current incremental learning methods on \emph{ByteFood}, reporting an improvement of 1.57\% mAP at least. The code and dataset will be made available sooner.

arxiv情報

著者	Sen Pei,Shixiong Xu,Ye Yuan,Xiaojie Jin
発行日	2022-09-14 10:33:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Global Prototype Encoding for Incremental Video Highlights Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー