Less is More: Data Value Estimation for Visual Instruction Tuning

要約

視覚的な命令のチューニングは、マルチモーダル大規模言語モデル (MLLM) を構築するための鍵であり、ビジョンシナリオにおける大規模言語モデル (LLM) の推論能力を大幅に向上させます。
ただし、既存の MLLM はほとんどの場合、トレーニング用に複数の非常に多様な視覚的命令データセット (100 万を超える命令) の混合に依存しているため、データの冗長性が生じる可能性があります。
この問題を調査するために、私たちは一連の実証研究を実施しました。その結果、視覚的指示データセット内の大幅な冗長性が明らかになり、複数の指示データセットの量を大幅に削減してもパフォーマンスには影響しないことが示されました。
この結果に基づいて、視覚指示データ内の冗長性を排除するための新しいデータ選択アプローチ TIVE を提案します。
TIVE はまず、計算された勾配に基づいてビジュアル命令のタスクレベルとインスタンスレベルの値を推定します。
次に、TIVE は、推定値に従って、視覚的命令内のタスクの割合を決定し、代表的なインスタンスを選択して、トレーニング用のより小さな視覚的命令のサブセットを構成します。
LLaVA-1.5 の実験では、わずか約 7.5% のデータを使用するアプローチが、7 つのベンチマークにわたってフルデータの微調整モデルと同等のパフォーマンスを達成でき、ベンチマークのうち 4 つではそれを上回ることが示されました。
私たちのコードとデータは一般に公開されます。

要約(オリジナル)

Visual instruction tuning is the key to building multimodal large language models (MLLMs), which greatly improves the reasoning capabilities of large language models (LLMs) in vision scenario. However, existing MLLMs mostly rely on a mixture of multiple highly diverse visual instruction datasets for training (even more than a million instructions), which may introduce data redundancy. To investigate this issue, we conduct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of several instruction dataset even do not affect the performance. Based on the findings, we propose a new data selection approach TIVE, to eliminate redundancy within visual instruction data. TIVE first estimates the task-level and instance-level value of the visual instructions based on computed gradients. Then, according to the estimated values, TIVE determines the task proportion within the visual instructions, and selects representative instances to compose a smaller visual instruction subset for training. Experiments on LLaVA-1.5 show that our approach using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model across seven benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.

arxiv情報

著者	Zikang Liu,Kun Zhou,Wayne Xin Zhao,Dawei Gao,Yaliang Li,Ji-Rong Wen
発行日	2024-03-14 16:47:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Less is More: Data Value Estimation for Visual Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー