EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

要約

大規模な言語モデル（LLM）の補強学習（RL）の最近の進歩は、多目的タスクに対処する際の微調整を示していますが、複雑な客観的バランス、トレーニング効率の低さ、スケーラビリティの低さ、説明可能性など、重大な課題に直面しています。
アンサンブル学習の原則を活用すると、効率と柔軟性を向上させるためにトレーニング後に集約を最適化しながら、個々の目的で複数のモデルを微調整するアンサンブル多目的RL（emorl）フレームワークを導入します。
私たちの方法は、個々のモデルの最後の隠された状態を集約した最初の方法であり、複数の目的からコンテキスト情報を組み込んでいます。
このアプローチは、最適な加重組み合わせを識別する階層グリッド検索アルゴリズムによってサポートされています。
テキストスコアリングLLMSを使用して世代を評価し、RL微調整中に報酬を提供し、カウンセラーリフレクションの生成タスクでEmorlを評価します。
ペアとPsych8Kデータセットの包括的な実験を通じて、既存のベースラインに対するEmorlの利点を実証します。トレーニング消費量が大幅に低く、より安定したトレーニング消費（17,529 \ PM 1,650 $データポイントと6,573ドル\ PM 147.43 $秒）、鱗と類似性のパフォーマンスの説明を実証します。

要約(オリジナル)

Recent advances in reinforcement learning (RL) for large language model (LLM) fine-tuning show promise in addressing multi-objective tasks but still face significant challenges, including complex objective balancing, low training efficiency, poor scalability, and limited explainability. Leveraging ensemble learning principles, we introduce an Ensemble Multi-Objective RL (EMORL) framework that fine-tunes multiple models with individual objectives while optimizing their aggregation after the training to improve efficiency and flexibility. Our method is the first to aggregate the last hidden states of individual models, incorporating contextual information from multiple objectives. This approach is supported by a hierarchical grid search algorithm that identifies optimal weighted combinations. We evaluate EMORL on counselor reflection generation tasks, using text-scoring LLMs to evaluate the generations and provide rewards during RL fine-tuning. Through comprehensive experiments on the PAIR and Psych8k datasets, we demonstrate the advantages of EMORL against existing baselines: significantly lower and more stable training consumption ($17,529\pm 1,650$ data points and $6,573\pm 147.43$ seconds), improved scalability and explainability, and comparable performance across multiple objectives.

arxiv情報

著者	Lingxiao Kong,Cong Yang,Susanne Neufang,Oya Deniz Beyan,Zeyd Boukhers
発行日	2025-05-06 06:26:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EMORL: Ensemble Multi-Objective Reinforcement Learning for Efficient and Flexible LLM Fine-Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー