Personalized Visual Instruction Tuning

要約

マルチモーダル大規模言語モデル (MLLM) の最近の進歩は、顕著な進歩を示しています。
ただし、これらのモデルには、「顔の盲目」と呼ばれる顕著な制限があります。
具体的には、一般的な会話には参加できますが、特定の個人を対象としたパーソナライズされた会話を行うことができません。
この欠陥により、モバイルデバイス上のカスタマイズされた視覚アシスタントや、家族のメンバーを認識する必要がある家庭用ロボットなど、個人化された環境での MLLM の適用が妨げられます。
このペーパーでは、MLLM が画像内のターゲット個人を識別し、パーソナライズされた一貫した対話を行えるように設計された新しいデータキュレーションおよびトレーニングフレームワークである Personalized Visual structs Tuning (PVIT) について紹介します。
私たちのアプローチには、パーソナライズされた会話を含むトレーニングデータを自律的に生成する洗練されたパイプラインの開発が含まれます。
このパイプラインは、さまざまなビジュアルエキスパート、画像生成モデル、および (マルチモーダル) 大規模言語モデルの機能を活用します。
MLLM のパーソナライズされた可能性を評価するために、さまざまな難易度のさまざまな質問タイプを網羅する P-Bench と呼ばれるベンチマークを提示します。
この実験では、厳選されたデータセットを使用して微調整した後、パーソナライズされたパフォーマンスが大幅に向上することが実証されています。

要約(オリジナル)

Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as ‘face blindness’. Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized conversations. This pipeline leverages the capabilities of various visual experts, image generation models, and (multi-modal) large language models. To evaluate the personalized potential of MLLMs, we present a benchmark called P-Bench, which encompasses various question types with different levels of difficulty. The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset.

arxiv情報

著者	Renjie Pi,Jianshu Zhang,Tianyang Han,Jipeng Zhang,Rui Pan,Tong Zhang
発行日	2024-10-09 17:46:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Personalized Visual Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー