AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

要約

最近、LLMSの複雑な推論能力を向上させるために、推論集約型の事前削除データを収集することに関心が高まっています。
通常、以前のアプローチは、人間またはLLMによる標識が必要な、そのようなデータを識別するために監督された分類器に依存しており、多くの場合、ドメイン固有のバイアスを導入します。
注意ヘッドはコンテキスト内の推論に不可欠であるため、監督信号のないシンプルで効果的でトレーニングのない方法であるAttentionInfluenceを提案します。
私たちのアプローチにより、小さな前提条件の言語モデルは、シンプルな注意ヘッドマスキング操作を介して強力なデータセレクターとして機能することができます。
具体的には、これらのヘッドをマスキングするときに検索ヘッドを特定し、損失の差を計算します。
1.3Bパラメーターの密なモデルに注意を払うと、241BトークンのSMOLLMコーパスでデータ選択を行い、SMOLLMコーパスと73Bトークンを含む選択されたサブセットを1TトレーニングトークンとWSD学習レートスケジュールを使用して7Bパラメーター密度モデルを前処理します。
私たちの実験結果は、1.4ppから3.5ppの範囲で、いくつかの知識集約的で推論が多いベンチマーク（つまり、MMLU、MMLU-Pro、Agieval-en、GSM8K、およびHumanval）にわたる大幅な改善を示しています。
これは、効果的な弱いスケーリングプロパティを示しており、小さなモデルは、推論中心のデータ選択のために有望でスケーラブルなパスを提供するより大きなモデルの最終パフォーマンスを改善します。

要約(オリジナル)

Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs’ complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.

arxiv情報

著者	Kai Hua,Steven Wu,Ge Zhang,Ke Shen
発行日	2025-05-12 07:25:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー