Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

要約

多様な大手言語モデル（LLMS）の急速な出現により、ユーザークエリを最も適切なモデルに割り当てるLLMルーターの開発が促進されました。
ただし、既存のLLMルーターは通常、単一ラウンドの1対1マッピング（\ textit {i.e。}、各クエリを単一モデルに割り当てて単一モデルに割り当てます）を実行します。
このホワイトペーパーでは、マルチLORMルーティングと集約を順次決定プロセスとして定式化する強化学習（RL）ベースのフレームワークである\ textBf {router-r1}を提示します。
Router-R1は、ルーター自体を有能なLLMとしてインスタンス化し、「ルート」アクション（動的モデルの呼び出し）と「Think」アクション（内部審議）をインターリーブする推論能力を活用し、各応答を進化するコンテキストに統合します。
学習を導くために、フォーマットの報酬、最終結果の報酬、パフォーマンスとコストのトレードオフの最適化に対する新しいコスト報酬を含む軽量ルールベースの報酬を採用し、RLを介したパフォーマンスコストのトレードオフを最適化するための経路を開きます。
Router-R1は、価格設定、遅延、パフォーマンスの例などの単純なモデル記述子のみを条件とし、目に見えないモデル選択に強力な一般化を可能にします。
7つの一般的およびマルチホップQAベンチマークでの実験は、Router-R1がいくつかの強力なベースラインよりも優れていることを示しており、堅牢な一般化とコスト管理を維持しながら優れたパフォーマンスを達成します。コードはhttps://github.com/ulab-uiuc/router-r1で入手できます。

要約(オリジナル)

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave ‘think’ actions (internal deliberation) with ‘route’ actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.Code is available at https://github.com/ulab-uiuc/Router-R1.

arxiv情報

著者	Haozhen Zhang,Tao Feng,Jiaxuan You
発行日	2025-06-10 17:56:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー