Vision-Language Instruction Tuning: A Review and Analysis

要約

命令チューニングは、大規模言語モデル (LLM) にとって不可欠な教師ありトレーニングフェーズであり、命令の実行を一般化し、ユーザーの好みに適応する LLM の能力を強化することが目的です。
LLM へのマルチモーダルデータの組み込みが進むにつれて、純粋なテキスト命令と比較してより複雑な機能を示すビジョン言語命令チューニングのパフォーマンスへの関心が高まっています。
この論文では、マルチモーダル LLM における最新のビジョン言語命令の調整設定とデータセットを体系的にレビューし、高品質のビジョン言語調整データが持つべき特性をまとめます。
私たちは、これらの特性をビジョン言語命令データを構築するための基本原則と考え、綿密に設計された命令特性評価指標を組み込んだデータ収集、命令生成、および品質管理モジュールで構成される完全な構築パイプラインを提案します。
私たちは、構築した命令データに基づいて 3 つの広く使用されているマルチモーダル LLM に対してビジョン言語命令チューニングを実行し、対応するメトリクスについて広範な実験を行って、この論文で提案した構築原理の合理性を実証します。
この論文に関連するコードとデータセットは、\url{https://github.com/palchenli/VL-struction-Tuning} でオープンソース化されています。

要約(オリジナル)

Instruction tuning is an essential supervised training phase for Large Language Models (LLMs), with the goal of enhancing LLMs’ capacity to generalize instruction execution and adapt to user preferences. With the growing incorporation of multi-modal data into LLMs, there is an increasing interest in the performance of vision-language instruction tuning which presents more complex features in comparison to pure text instructions. In this paper, we systematically review the latest vision-language instruction tuning settings and datasets in multi-modal LLMs and summarize the characteristics that high-quality vision-language tuning data should have. We consider these characteristics as the foundational principles for constructing vision-language instruction data and propose a complete construction pipeline consisting of data collection, instruction generation, and quality control modules that incorporate meticulously designed instruction property evaluation indicators. We perform vision-language instruction tuning on three widely used multi-modal LLMs based on the instruction data we constructed and conduct extensive experiments on the corresponding metrics to demonstrate the rationality of the construction principles proposed in this paper. The code and dataset related to this paper have been open-sourced at \url{https://github.com/palchenli/VL-Instruction-Tuning}.

arxiv情報

著者	Chen Li,Yixiao Ge,Dian Li,Ying Shan
発行日	2023-11-14 14:02:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision-Language Instruction Tuning: A Review and Analysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー