Pre-training for Action Recognition with Automatically Generated Fractal Datasets

要約

近年、合成データへの関心が高まっており、特に物体分類や医療画像処理などを含むさまざまなコンピュータビジョンタスクをサポートする画像モダリティの事前トレーニングの文脈において、合成データが自動的に生成されることが実証されています。
さまざまな生成プロセスを使用して、実際の対応物を置き換え、強力な視覚的表現を生み出すことができます。
このアプローチにより、収集とラベル付けのコスト、著作権、プライバシーなど、実際のデータに関連する問題が解決されます。
私たちはこの傾向をビデオ領域に拡張し、動作認識のタスクに適用します。
フラクタル幾何学を利用して、ニューラルモデルの事前トレーニングに利用できる、短い合成ビデオクリップの大規模なデータセットを自動的に生成する方法を紹介します。
生成されたビデオクリップは、複雑なマルチスケール構造を生成するフラクタルの生来の能力によって引き起こされる、顕著な多様性によって特徴付けられます。
ドメインギャップを狭めるために、実際のビデオの主要なプロパティをさらに特定し、事前トレーニング中にそれらを注意深くエミュレートします。
徹底的なアブレーションを通じて、下流の結果を強化する属性を特定し、合成ビデオを使用した事前トレーニングのための一般的なガイドラインを提供します。
提案されたアプローチは、確立された動作認識データセット HMDB51 および UCF101 と、グループ動作認識、きめ細かい動作認識、および動的シーンに関連する他の 4 つのビデオベンチマークに基づいて事前トレーニングされたモデルを微調整することによって評価されます。
標準的な Kinetics の事前トレーニングと比較して、報告された結果は、下流のデータセットの一部ではほぼ同等であり、さらに優れています。
合成ビデオのコードとサンプルは https://github.com/davidsvy/fractal_video で入手できます。

要約(オリジナル)

In recent years, interest in synthetic data has grown, particularly in the context of pre-training the image modality to support a range of computer vision tasks, including object classification, medical imaging etc. Previous work has demonstrated that synthetic samples, automatically produced by various generative processes, can replace real counterparts and yield strong visual representations. This approach resolves issues associated with real data such as collection and labeling costs, copyright and privacy. We extend this trend to the video domain applying it to the task of action recognition. Employing fractal geometry, we present methods to automatically produce large-scale datasets of short synthetic video clips, which can be utilized for pre-training neural models. The generated video clips are characterized by notable variety, stemmed by the innate ability of fractals to generate complex multi-scale structures. To narrow the domain gap, we further identify key properties of real videos and carefully emulate them during pre-training. Through thorough ablations, we determine the attributes that strengthen downstream results and offer general guidelines for pre-training with synthetic videos. The proposed approach is evaluated by fine-tuning pre-trained models on established action recognition datasets HMDB51 and UCF101 as well as four other video benchmarks related to group action recognition, fine-grained action recognition and dynamic scenes. Compared to standard Kinetics pre-training, our reported results come close and are even superior on a portion of downstream datasets. Code and samples of synthetic videos are available at https://github.com/davidsvy/fractal_video .

arxiv情報

著者	Davyd Svyezhentsev,George Retsinas,Petros Maragos
発行日	2024-11-26 16:51:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pre-training for Action Recognition with Automatically Generated Fractal Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー