Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

要約

前処理された大規模な言語モデル（LLM）は、微調整（SFT）段階（Zhou et al。、2024）の間に最小限の監督のみを必要とするという仮説は、データのキュレーションと選択研究の最近の進歩によって実証されています。
ただし、実験的セットアップと検証プロトコルに対する脆弱性のために安定性と一般化可能性は損なわれ、ランダムサンプリングを超えることはできません（Diddee＆Ippolito、2024; Xia et al。、2024b）。
LLMSに基づいて構築されたマルチモーダルLLMS（MLLMS）は、データソースの純粋なトークンボリュームと高度な不均一性と組み合わされ、データ選択の重要性と複雑さの両方を増幅します。
マルチモーダルの指導データを堅牢で効率的な方法で収集するために、14の視覚関連機能に分解することにより、品質メトリックの粒度を再定義し、マルチモーダルのリッチスコアラーを導入して、各データ候補の機能を評価します。
アライメント段階の固有の目的に照らして、多様性を促進するために、相互作用スタイルを多様性インジケーターとして採用し、マルチモーダルリッチスタイラーを使用してデータ命令パターンを特定します。
そうすることで、私たちのマルチモーダルリッチスコアラーおよびスタイラー（MMSSR）は、高得点情報が多様な形でユーザーに伝えられることを保証します。
埋め込みベースのクラスタリングまたは貪欲なサンプリングがないため、MMSSRは、さまざまな予算の制約を伴う数百万のデータに効率的にスケーリングし、一般的または特定の機能獲得のカスタマイズをサポートし、キュレーションのための新しいドメインへのトレーニングのない一般化を促進します。
14のマルチモーダルベンチマークによって検証された10以上の実験設定で、ランダムサンプリング、ベースライン戦略、最先端の選択方法よりも一貫した改善が示され、2.6mデータの30％でフルパフォーマンスの99.1％を達成します。

要約(オリジナル)

The hypothesis that pretrained large language models (LLMs) necessitate only minimal supervision during the fine-tuning (SFT) stage (Zhou et al., 2024) has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are compromised due to the vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling (Diddee & Ippolito, 2024; Xia et al., 2024b). Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust and efficient manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related capabilities, and introduce multi-modal rich scorers to evaluate the capabilities of each data candidate. To promote diversity, in light of the inherent objective of the alignment stage, we take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. In doing so, our multi-modal rich scorers and styler (mmSSR) guarantee that high-scoring information is conveyed to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports customization for general or specific capability acquisition, and facilitates training-free generalization to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance with only 30% of the 2.6M data.

arxiv情報

著者	Mengyao Lyu,Yan Li,Huasong Zhong,Wenhao Yang,Hui Chen,Jungong Han,Guiguang Ding,Zhenheng Yang
発行日	2025-03-17 17:11:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー