Lessons from Training Grounded LLMs with Verifiable Rewards

要約

接地された信頼できる応答を生成することは、大規模な言語モデル（LLM）にとって重要な課題です。
引用ベースの接地を備えた検索された生成（RAG）は約束を保持していますが、命令調整モデルは、簡単なシナリオでも頻繁に失敗します。
この作業では、強化学習（RL）と内部推論がLLMSの接地を強化する方法を探ります。
GRPO（グループ相対ポリシーの最適化）メソッドを使用して、検証可能な結果ベースの報酬を使用してモデルをトレーニングします。
ASQA、QAMPARI、ELI5、およびExpertQAを介した包括的な実験を通じて、推論モデルは、特に未回答のクエリを処理して適切に引用された応答を生成する際に、命令のみのバリアントを大幅に上回ることを示しています。
2段階のトレーニングセットアップ、最初に回答と引用の動作を最適化し、次に拒否し、学習信号を安定化することにより、さらに接地を改善します。
さらに、GPT-4蒸留を介して命令の調整を再検討し、それをGRPOと組み合わせることで、長型の生成QAタスクのパフォーマンスが向上することがわかります。
全体として、私たちの調査結果は、より検証可能で信頼性の高いLLMを構築するための推論、段階的な最適化、および結果主導のRLの価値を強調しています。

要約(オリジナル)

Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.

arxiv情報

著者	Shang Hong Sim,Tej Deep Pala,Vernon Toh,Hai Leong Chieu,Amir Zadeh,Chuan Li,Navonil Majumder,Soujanya Poria
発行日	2025-06-18 14:58:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Lessons from Training Grounded LLMs with Verifiable Rewards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー