USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions

要約

ビジョン言語モデルをロボットシステムに統合することは、より直感的な方法でマシンが周囲と相互作用できるようにすることにおける重要な進歩を構成します。
VLMは豊富なマルチモーダル推論を提供しますが、既存のアプローチにはユーザー固有の適応性がなく、多くの場合、個々の行動、文脈、または社会的感情的なニュアンスを説明できない一般的な相互作用パラダイムに依存しています。
カスタマイズが試みられると、倫理的な懸念は、ユーザーデータのバイアスの許可されていないバイアス、除外、または不公正な治療から生じます。
これらの二重の課題に対処するために、ユーザー-VLM 360 {\ deg}を提案します。これは、バイアス認識の最適化をマルチモーダルユーザーモデリングと統合した全体的なフレームワークです。
私たちのアプローチ機能：（1）視覚的言語シグナルを使用してリアルタイムで相互作用を適応させるユーザー認識チューニング。
（2）優先最適化によるバイアス緩和。
（3）360 {\ deg}人口統計、感情、関係のメタデータと注釈が付けられた社会的相互作用データセットをキュレーションしました。
8つのベンチマークにわたる評価は、最先端の結果を示しています。パーソナライズされたVQAで +35.3％F1、顔の特徴の理解における +47.5％F1、15％のバイアス削減、およびベースライン上の30倍のスピードアップ。
アブレーション研究では、コンポーネントの有効性が確認され、ペッパーロボットの展開は、多様なユーザー全体でリアルタイムの適応性を検証します。
オープンソースパラメーター効率の高い3B/10Bモデルと、責任ある適応のための倫理的検証フレームワーク。

要約(オリジナル)

The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360{\deg}, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360{\deg} socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30X speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.

arxiv情報

著者	Hamed Rahimi,Adil Bahaj,Mouad Abrini,Mahdi Khoramshahi,Mounir Ghogho,Mohamed Chetouani
発行日	2025-02-28 09:38:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー