Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models

要約

現在のビジョン言語マルチモーダルモデルは、一般的な視覚的理解タスクに適しています。
ただし、特殊なビジョン言語指導データが生まれるデータがないため、人間のポーズとアクションに関連する複雑な視覚タスクを処理する場合、不十分に実行します。
人間のキーポイントをキャプションや境界ボックスなどの従来の視覚機能と統合し、人間中心のシーンをより正確に理解できるようにすることにより、そのようなデータを生成する方法を紹介します。
私たちのアプローチは、人間中心のタスクのモデルを微調整するために調整された200,328のサンプルで構成されるデータセットを構築し、会話、詳細な説明、複雑な推論の3つの領域に焦点を当てています。
ヒューマンポーズとアクション理解ベンチマーク（HPAUB）と呼ばれるベンチマークを確立して、人間のポーズとアクション理解のモデルパフォーマンスを評価します。
このデータセットを使用してLLAVA-1.5-7Bモデルを微調整し、ベンチマークで評価し、大幅な改善を達成します。
実験結果は、元のLLAVA-1.5-7Bモデルと比較して、全体的な改善が21.18％であることを示しています。
これらの調査結果は、マルチモーダルモデルの強化におけるキーポイント統合データの有効性を強調しています。
コードはhttps://github.com/ody-trek/keypoint-instruction-tuningで入手できます。

要約(オリジナル)

Current vision-language multimodal models are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish a benchmark called Human Pose and Action Understanding Benchmark (HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate it on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 21.18% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models. Code is available at https://github.com/Ody-trek/Keypoint-Instruction-Tuning.

arxiv情報

著者	Dewen Zhang,Wangpeng An,Hayaru Shouno
発行日	2025-06-02 09:12:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー