Weighted-Reward Preference Optimization for Implicit Model Fusion

要約

アーキテクチャやサイズが異なる異種オープンソース LLM を融合すると、さまざまなモデルの長所を統合できる可能性がありますが、既存の融合手法は語彙の調整や分布行列の結合などの重大な課題に直面しています。
これらの手順は複雑なだけでなく、ノイズやエラーが発生する傾向があります。
この論文では、ソース LLM とターゲット LLM の間の優先度の最適化を活用して、それらの機能を効果的に転送する、暗黙的な融合手法である加重報酬優先度最適化 (WRPO) を提案します。
WRPO は語彙の調整やマトリックスの融合の必要性を排除し、さまざまな LLM に対応するために効率的に拡張できます。
ソース LLM とターゲット LLM 間の分布の偏りに対処するために、WRPO は、好ましい例への依存をターゲット LLM からソース LLM に徐々に移行する漸進的適応戦略を導入します。
MT-Bench、AlpacaEval-2、および Arena-Hard ベンチマークに関する広範な実験により、WRPO が既存の知識融合手法やさまざまな微調整ベースラインよりも一貫して優れていることが実証されました。
ターゲットモデルとして LLaMA3-8B-Instruct に適用すると、WRPO は、AlpacaEval-2 上の GPT-4-Preview-1106 に対して長さ制御された勝率 55.9% を達成し、AlpacaEval-2 上の GPT-4-0314 に対して 46.2% の勝率を達成します。
アリーナハード。
私たちのコードは \url{https://github.com/SLIT-AI/WRPO} で入手できます。

要約(オリジナル)

While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. Our code is available at \url{https://github.com/SLIT-AI/WRPO}.

arxiv情報

著者	Ziyi Yang,Fanqi Wan,Longguang Zhong,Tianyuan Shi,Xiaojun Quan
発行日	2024-12-04 10:15:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Weighted-Reward Preference Optimization for Implicit Model Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー