LongReward: Improving Long-context Large Language Models with AI Feedback

要約

ロングコンテキストのラージ言語モデル (LLM) の開発では大幅な進歩が達成されましたが、教師ありファインチューニング (SFT) 用の LLM 合成データの品質の低下は、多くの場合、SFT モデルのロングコンテキストのパフォーマンスに影響を与え、固有の制限につながります。
原則として、適切な報酬信号を使用した強化学習 (RL) は、モデルの能力をさらに強化できます。
ただし、長いコンテキストのシナリオで信頼できる報酬を取得する方法はまだ解明されていません。
この目的を達成するために、私たちは LongReward を提案します。これは、既製の LLM を利用して、有用性、論理性、忠実性、完全性という人間の価値観の 4 つの側面からのロングコンテキストモデルの応答に報酬を提供する新しい方法です。それぞれの要素は慎重に設計されています。
評価パイプライン。
LongReward とオフライン RL アルゴリズム DPO を組み合わせることで、ロングコンテキスト SFT モデルを効果的に改善できます。
私たちの実験は、LongReward がモデルの長いコンテキストのパフォーマンスを大幅に向上させるだけでなく、短い命令に従う能力も強化することを示しています。
また、LongReward を使用したロングコンテキスト DPO と従来のショートコンテキスト DPO は、どちらのパフォーマンスも損なうことなく併用できることもわかりました。

要約(オリジナル)

Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models’ capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models’ long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one’s performance.

arxiv情報

著者	Jiajie Zhang,Zhongni Hou,Xin Lv,Shulin Cao,Zhenyu Hou,Yilin Niu,Lei Hou,Yuxiao Dong,Ling Feng,Juanzi Li
発行日	2024-10-28 17:50:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LongReward: Improving Long-context Large Language Models with AI Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー