Video alignment using unsupervised learning of local and global features

要約

この論文では、ビデオの位置合わせ、つまり類似のアクションを含む 1 対のビデオのフレームを一致させるプロセスの問題に取り組みます。
ビデオの調整における主な課題は、2 つのビデオ間の実行プロセスと外観の違いにもかかわらず、正確な対応関係を確立する必要があることです。
フレームのグローバルおよびローカルの特徴を使用する、教師なしの位置合わせ方法を導入します。
特に、人物検出、姿勢推定、VGG ネットワークという 3 つのマシンビジョンツールを使用して、各ビデオフレームに効果的な機能を導入します。
次に、特徴が処理されて結合されて、ビデオを表す多次元の時系列が構築されます。
結果の時系列は、Diagonalized Dynamic Time Warping (DDTW) という新しいバージョンのダイナミックタイムワーピングを使用して、同じアクションのビデオを整列するために使用されます。
私たちのアプローチの主な利点は、トレーニングが必要ないことです。そのため、トレーニングサンプルを収集する必要がなく、あらゆる新しいタイプのアクションに適用できます。
さらに、私たちのアプローチは、少数のラベル付きビデオのみを含むデータセット内のアクションフェーズのフレームごとのラベル付けにも使用できます。
評価のために、ペンアクションと UCF101 データセットのサブセットに対するビデオ同期とフェーズ分類タスクを検討しました。
また、ビデオ同期タスクを効果的に評価するために、Enclosed Area Error (EAE) と呼ばれる新しい指標を提示します。
結果は、私たちの手法が、TCC やその他の自己教師ありおよび弱教師ありの手法などの以前の最先端の手法よりも優れていることを示しています。

要約(オリジナル)

In this paper, we tackle the problem of video alignment, the process of matching the frames of a pair of videos containing similar actions. The main challenge in video alignment is that accurate correspondence should be established despite the differences in the execution processes and appearances between the two videos. We introduce an unsupervised method for alignment that uses global and local features of the frames. In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network. Then the features are processed and combined to construct a multidimensional time series that represent the video. The resulting time series are used to align videos of the same actions using a novel version of dynamic time warping named Diagonalized Dynamic Time Warping(DDTW). The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it. Additionally, our approach can be used for framewise labeling of action phases in a dataset with only a few labeled videos. For evaluation, we considered video synchronization and phase classification tasks on the Penn action and subset of UCF101 datasets. Also, for an effective evaluation of the video synchronization task, we present a new metric called Enclosed Area Error(EAE). The results show that our method outperforms previous state-of-the-art methods, such as TCC, and other self-supervised and weakly supervised methods.

arxiv情報

著者	Niloufar Fakhfour,Mohammad ShahverdiKondori,Sajjad Hashembeiki,Mohammadjavad Norouzi,Hoda Mohammadzade
発行日	2024-09-06 12:09:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video alignment using unsupervised learning of local and global features

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー