HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

要約

頭部姿勢推定 (HPE) では、正確なヨー角、ピッチ角、ロール角を生成するには、3D 空間関係を高度に理解する必要があります。
以前の HPE モデルは、主に CNN ベースであり、入力としてトリミングされた人間の頭部のクローズアップ画像に依存しており、多くの場合、現実世界のシナリオでは堅牢性に欠けています。
ビジョン言語モデル (VLM) は、注意メカニズムを通じて特定のオブジェクトに焦点を当てながら、画像全体を分析できます。
この論文では、CogVLM と呼ばれる VLM の物体検出接地機能を活用することで HPE の精度を向上させる新しいフレームワークを提案します。
HPE タスク用にこの VLM を LoRA で直接微調整すると、望ましい HPE 精度を達成できないことが経験的にわかりました。一方、一部のモデル結合方法は精度を向上させることができますが、頻繁に混合された無効な応答形式が生成され、物体検出と HPE タスクの両方を同時に処理するのに苦労しています。
HPE の機能を CogVLM に効果的に統合するために、新しい LoRA レイヤーベースのモデル結合方法を開発しました。
このマージアプローチでは、高いコサイン類似度しきい値と勝者総取りのレイヤー選択戦略が適用され、元のオブジェクト検出の知識を維持しながら HPE タスクに注意を集中させます。
無効な応答形式が混在している問題を解決し、精度を向上させます。
結果は、当社の HPE-CogVLM が、クロスデータセット評価において、現在の最先端の CNN モデルである 6DRepNet と比較して、平均絶対誤差を 31.5% 削減することを示しています。
さらに、HPE-CogVLM は、すべての HPE メトリクスにわたって、直接 LoRA で微調整された VLM とタスク演算ベースのマージ VLM の両方を上回っています。

要約(オリジナル)

Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging approach applies a high cosine similarity threshold and a winner-takes-all layer selection strategy, aligning attention to the HPE task while preserving original object detection knowledge. It successfully resolves issues with blended invalid response formats and improves accuracy. Results show that our HPE-CogVLM achieves a 31.5\% reduction in Mean Absolute Error over the current state-of-the-art CNN model, 6DRepNet, in cross-dataset evaluation. Furthermore, HPE-CogVLM outperforms both directly LoRA fine-tuned and task arithmetic-based merged VLMs across all HPE metrics.

arxiv情報

著者	Yu Tian,Tianqi Shao,Tsukasa Demizu,Xuyang Wu,Hsin-Tai Wu
発行日	2024-11-08 17:33:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー