Language Imbalance Driven Rewarding for Multilingual Self-improving

要約

大規模言語モデル（LLM）は、数多くのタスクにおいて最先端の性能を達成してきた。しかし、このような進歩は、英語や中国語などの「一流」言語が主な恩恵を受けており、他の多くの言語が十分に利用されていない。この不均衡は、より広範な応用を制限する一方で、言語間の自然な優先順位付けを生成し、自己改善的な方法でLLMの多言語能力をブートストラップする機会を提供する。そこで、我々は$textit{Language Imbalance Driven Rewarding}$を提案し、LLM内の優勢言語と非優勢言語の間の固有の不均衡を報酬信号として活用する。反復的なDPOトレーニングにより、このアプローチが非ドミナント言語のLLMパフォーマンスを向上させるだけでなく、ドミナント言語の能力も向上させ、反復的な報酬シグナルをもたらすことが実証された。このアプローチを2回繰り返してMeta-Llama-3-8B-Instructを微調整した結果、命令追従タスクと算数推論タスクにおいて多言語性能が継続的に向上し、X-AlpacaEvalリーダーボードでは平均7.46%の勝率向上、MGSMベンチマークでは13.9%の精度向上が実証された。この研究は、LLMの多言語自己改良への道を開く、最初の探求となる。

要約(オリジナル)

Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited ‘first-class’ languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose $\textit{Language Imbalance Driven Rewarding}$, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language’s capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs.

arxiv情報

著者	Wen Yang,Junhong Wu,Chen Wang,Chengqing Zong,Jiajun Zhang
発行日	2024-11-01 15:53:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Language Imbalance Driven Rewarding for Multilingual Self-improving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー