LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

要約

Web ビデオでトレーニングされた現在の大規模言語視覚モデル (LLVM) は、一般的なビデオの理解では良好に機能しますが、きめの細かい詳細、複雑な人間とオブジェクトのインタラクション (HOI)、および日常生活活動 (ADL) に不可欠なビュー不変表現の学習に苦労しています。
。
この制限は、特殊な ADL ビデオ命令調整データセットの欠如と、差別的な動作表現を捕捉するためのモダリティの統合が不十分であることに起因しています。
これに対処するために、ADL データセットをキュレーションし、マルチビュー、マルチモーダル RGBS 命令チューニングデータセットである ADL-X を作成するための半自動フレームワークを提案します。
さらに、ADL の複雑な時空間関係をモデル化するために、ビデオ、3D スケルトン、HOI を統合する LLVM である LLAVIDAL を紹介します。
LLAVIDAL のトレーニングでは、すべてのモダリティの単純な関節調整では次善の結果が得られます。
したがって、私たちは、カリキュラムに従って段階的にモダリティを組み込む、マルチモーダルプログレッシブ (MMPro) トレーニング戦略を提案します。
また、ADL タスクにおける LLVM のパフォーマンスを評価するために、ADL MCQ およびビデオ記述ベンチマークを確立します。
ADL-X でトレーニングされた LLAVIDAL は、ADL ベンチマーク全体で最先端のパフォーマンスを実現します。
コードとデータは https://adl-x.github.io/ で公開されます。

要約(オリジナル)

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL’s complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.

arxiv情報

著者	Dominick Reilly,Rajatsubhra Chakraborty,Arkaprava Sinha,Manish Kumar Govind,Pu Wang,Francois Bremond,Le Xue,Srijan Das
発行日	2024-12-12 18:58:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー