Multi-Domain Explainability of Preferences

要約

人間の好み、LLM-as-a-a-judge（laaj）、報酬モデルなどの選好メカニズムは、大規模な言語モデル（LLM）を調整および評価するための中心です。
しかし、これらの好みを促進する根本的な概念は、よく理解されていません。
この作業では、複数のドメインにわたって好みのローカルおよびグローバルな概念ベースの説明を生成するための完全に自動化されたエンドツーエンド方法を提案します。
私たちの方法では、LLMを使用して、選択された応答と拒否された応答を区別し、概念ベースのベクトルで表現する概念を発見します。
概念と好みの関係をモデル化するために、ドメイン総長とドメイン固有の効果の両方をキャプチャするホワイトボックスの階層マルチドメイン回帰モデルを提案します。
私たちの方法を評価するために、8つの挑戦的で多様なドメインにまたがるデータセットをキュレートし、12のメカニズムを説明します。
私たちの方法は、強い優先予測のパフォーマンスを実現し、ベースラインを上回りながら説明可能です。
さらに、2つの新しいアプリケーション駆動型設定で説明を評価します。
第一に、LAAJの説明からの概念を使用してLLM出力をガイドすることは、それらの裁判官が一貫して好む応答をもたらします。
第二に、人間を説明する概念でラージを促すことで、好みの予測が向上します。
一緒に、私たちの作品は、LLMSの時代における説明可能性のための新しいパラダイムを提供します。

要約(オリジナル)

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated end-to-end method for generating local and global concept-based explanations of preferences across multiple domains. Our method employs an LLM to discover concepts that differentiate between chosen and rejected responses and represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two novel application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work provides a new paradigm for explainability in the era of LLMs.

arxiv情報

著者	Nitay Calderon,Liat Ein-Dor,Roi Reichart
発行日	2025-05-26 15:01:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-Domain Explainability of Preferences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー