RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

要約

ロールプレイング会話エージェント（RPCA）は、役割の一貫性を維持する上で永続的な課題に直面しています。
これに対処するために、検証可能なロールアウェアネス報酬（VRAR）を統合する新しい強化学習フレームワークであるRaiden-R1を提案します。
この方法では、ロール固有のキーを評価することにより、定量化可能な報酬を生成するために、単数形と複数期のマイニング戦略の両方を導入します。
さらに、Multi-LLMコラボレーションを通じて高品質のロール認識チェーンデータセットを構築し、推論のコヒーレンスを強化する実験を実装します。
Raiden Benchmarkの実験は、Raiden-R1の優位性を示しています。14B-GRPOモデルは、スクリプトベースの知識と会話メモリメトリックの88.04％と88.65％の精度をそれぞれ達成し、ベースラインモデルをそれぞれ上回って堅牢性を維持します。
ケース分析により、競合するコンテキストの手がかりを解決し、一人称の物語の一貫性を維持するモデルの強化された能力がさらに明らかになります。
この作業は、RPCAトレーニングにおける非定量化性ギャップを埋め、ロール認識の推論パターンに関する洞察を提供し、RPCAの開発を進めます。

要約(オリジナル)

Role-playing conversational agents (RPCAs) face persistent challenges in maintaining role consistency. To address this, we propose RAIDEN-R1, a novel reinforcement learning framework that integrates Verifiable Role-Awareness Reward (VRAR). The method introduces both singular and multi-term mining strategies to generate quantifiable rewards by assessing role-specific keys. Additionally, we construct a high-quality, role-aware Chain-of-Thought dataset through multi-LLM collaboration, and implement experiments to enhance reasoning coherence. Experiments on the RAIDEN benchmark demonstrate RAIDEN-R1’s superiority: our 14B-GRPO model achieves 88.04% and 88.65% accuracy on Script-Based Knowledge and Conversation Memory metrics, respectively, outperforming baseline models while maintaining robustness. Case analyses further reveal the model’s enhanced ability to resolve conflicting contextual cues and sustain first-person narrative consistency. This work bridges the non-quantifiability gap in RPCA training and provides insights into role-aware reasoning patterns, advancing the development of RPCAs.

arxiv情報

著者	Zongsheng Wang,Kaili Sun,Bowen Wu,Qun Yu,Ying Li,Baoxun Wang
発行日	2025-05-15 12:22:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RAIDEN-R1: Improving Role-awareness of LLMs via GRPO with Verifiable Reward

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー