Improving Neutral Point of View Text Generation through Parameter-Efficient Reinforcement Learning and a Small-Scale High-Quality Dataset

要約

このホワイトペーパーでは、データセットの構築と、生成的な大手言語モデル（LLMS）を改善するためのトレーニング方法の評価について説明します。
データセットであるSHQ-NPOVデータセットは、300の高品質で人間が作成したクアドルプレットで構成されています。デリケートなトピック、回答、NPOV評価、およびさまざまな視点を詳しく説明するソーステキストへのリンクのセット。
このペーパーの最初の重要な貢献は、データセットと一緒にリリースする人間のピアクリチックおよびアノテータートレーニングの反復ラウンドを通じて、このようなデータセットを作成する新しい方法論です。
2番目の重要な貢献は、NPOV生成を改善するためのパラメーター効率の高い強化学習（PE-RL）のための非常に効果的なトレーニング体制の特定です。
Lora Finetuning（強力なベースライン）、SFT、RLHFを含むPE-RLと複数のベースラインを比較して評価します。
PE-RLは、最強のベースライン（$ 97.06 \％\ rightArrow 99.08 \％$）と比較して、全体的なNPOVの品質を改善するだけでなく、最高の回答を識別するための鍵として識別される特徴の鍵としてもはるかに高いスコア（$ 60.25 \％\ rightArrow 85.21 \％$ 68.74
単純化しすぎないための91.43 \％$）。
定性分析がこれを裏付けています。
最後に、我々の評価では、トレーニングデータセットに表示されるトピックと分離された評価トピックに表示されているトピックの結果の統計的な違いは見つかりません。これは、トレーニングへのアプローチがトピックの一般化から非常に効果的であることを示す強力な証拠を提供します。

要約(オリジナル)

This paper describes the construction of a dataset and the evaluation of training methods to improve generative large language models’ (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e., to provide significantly more informative, diverse and impartial answers. The dataset, the SHQ-NPOV dataset, comprises 300 high-quality, human-written quadruplets: a query on a sensitive topic, an answer, an NPOV rating, and a set of links to source texts elaborating the various points of view. The first key contribution of this paper is a new methodology to create such datasets through iterative rounds of human peer-critique and annotator training, which we release alongside the dataset. The second key contribution is the identification of a highly effective training regime for parameter-efficient reinforcement learning (PE-RL) to improve NPOV generation. We compare and extensively evaluate PE-RL and multiple baselines-including LoRA finetuning (a strong baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline ($97.06\%\rightarrow 99.08\%$), but also scores much higher on features linguists identify as key to separating good answers from the best answers ($60.25\%\rightarrow 85.21\%$ for presence of supportive details, $68.74\%\rightarrow 91.43\%$ for absence of oversimplification). A qualitative analysis corroborates this. Finally, our evaluation finds no statistical differences between results on topics that appear in the training dataset and those on separated evaluation topics, which provides strong evidence that our approach to training PE-RL exhibits very effective out of topic generalization.

arxiv情報

著者	Jessica Hoffmann,Christiane Ahlheim,Zac Yu,Aria Walfrand,Jarvis Jin,Marie Tano,Ahmad Beirami,Erin van Liemt,Nithum Thain,Hakim Sidahmed,Lucas Dixon
発行日	2025-03-05 16:32:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Neutral Point of View Text Generation through Parameter-Efficient Reinforcement Learning and a Small-Scale High-Quality Dataset

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー