JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

要約

ビデオアクション検出（VAD：Video Action Detection）は、ビデオ内のアクションインスタンスをローカライズし、分類することを必要とするが、ビデオには本来、音声、視覚的手がかり、周囲のシーンコンテキストなどの多様な情報源が含まれている。このようなマルチモーダルな情報をVADに効果的に活用することは、モデルがアクションに関連する手がかりを正確に識別しなければならないため、重要な課題となる。本研究では、JoVALE（Joint Actor-centric Visual, Audio, Language Encoder）と呼ばれる、新しいマルチモーダルVADアーキテクチャを紹介する。JoVALEは、大容量の画像キャプションモデルから得られるシーン記述コンテキストと、音声および視覚特徴を統合する最初のVAD手法である。JoVALEの中核は、音声、視覚、シーン記述情報の俳優中心の集約であり、各俳優の行動を認識するための重要な特徴の適応的統合を可能にする。我々はTransformerベースのアーキテクチャであるActor-centric Multi-modal Fusion Networkを開発し、特にアクターとそのマルチモーダルコンテキスト間の動的相互作用を捉えるように設計した。AVA、UCF101-24、JHMDB51-21を含む3つの著名なVADベンチマークを用いた評価により、マルチモーダル情報を取り込むことで性能が大幅に向上することが実証され、この分野における最先端の性能が設定された。

要約(オリジナル)

Video Action Detection (VAD) entails localizing and categorizing action instances within videos, which inherently consist of diverse information sources such as audio, visual cues, and surrounding scene contexts. Leveraging this multi-modal information effectively for VAD poses a significant challenge, as the model must identify action-relevant cues with precision. In this study, we introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models. At the heart of JoVALE is the actor-centric aggregation of audio, visual, and scene descriptive information, enabling adaptive integration of crucial features for recognizing each actor’s actions. We have developed a Transformer-based architecture, the Actor-centric Multi-modal Fusion Network, specifically designed to capture the dynamic interactions among actors and their multi-modal contexts. Our evaluation on three prominent VAD benchmarks, including AVA, UCF101-24, and JHMDB51-21, demonstrates that incorporating multi-modal information significantly enhances performance, setting new state-of-the-art performances in the field.

arxiv情報

著者	Taein Son,Soo Won Seo,Jisong Kim,Seok Hwan Lee,Jun Won Choi
発行日	2025-02-03 12:27:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー