Inverse Constitutional AI: Compressing Preferences into Principles

要約

フィードバックデータは、最先端のAIモデルの微調整と評価に広く使用されています。
ペアワイズテキストの設定は、人間またはAIアノテーターが2つのオプションの「より良い」を選択する場合、特に一般的です。
このような好みは、モデルをトレーニング（報酬）したり、統計を集約したりしてモデルをランク付けするために使用されます。
多くのアプリケーションでは、アノテーターの好みをモデル化に加えて理解することが望ましいです。特に、広範な事前の作業が優先データセットにさまざまな意図しないバイアスを示しているためです。
しかし、優先データセットは解釈するのが難しいままです。
ブラックボックスの報酬モデルも統計も、あるテキストが別のテキストよりも好まれる理由に答えることはできません。
通常、多数の（長い）応答ペアの手動解釈は、通常も同様に実行不可能です。
このホワイトペーパーでは、逆憲法AI（ICAI）問題を紹介し、ペアワイズテキスト選好データの圧縮タスクとして解釈を定式化します。
憲法AIでは、一連の原則（憲法）がフィードバックを提供し、AIモデルを微調整するために使用されます。
ICAIはこのプロセスを反転させます。フィードバックデータセットを考慮して、大規模な言語モデル（LLM）が元の注釈を再構築できるようにする憲法を抽出することを目指しています。
対応するICAIアルゴリズムを提案し、いくつかのデータセットの注釈再構成の精度に基づいて定量的に生成された構成要素を検証します。（a）既知の原則を持つ合成フィードバックデータ。
（b）アルパカエバルクロスアノテートされたヒトフィードバックデータ。
（c）クラウドソーシングチャットボットアリーナデータ。
（d）多様な人口統計グループからのプリズムデータ。
元のデータセットの短く解釈可能な表現として、生成された憲法には多くの潜在的なユースケースがあります。望ましくないアノテーターバイアスを識別し、モデルのパフォーマンスをよりよく理解し、目に見えないデータに対するフィードバックをスケールする、または個々のユーザーまたはグループの好みに合わせてモデルを適応させます。
https://github.com/rdnfn/icaiでソースコードをリリースします。

要約(オリジナル)

Feedback data is widely used for fine-tuning and evaluating state-of-the-art AI models. Pairwise text preferences, where human or AI annotators select the ‘better’ of two options, are particularly common. Such preferences are used to train (reward) models or to rank models with aggregate statistics. For many applications it is desirable to understand annotator preferences in addition to modelling them – not least because extensive prior work has shown various unintended biases in preference datasets. Yet, preference datasets remain challenging to interpret. Neither black-box reward models nor statistics can answer why one text is preferred over another. Manual interpretation of the numerous (long) response pairs is usually equally infeasible. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. ICAI inverts this process: given a feedback dataset, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding ICAI algorithm and validate its generated constitutions quantitatively based on annotation reconstruction accuracy on several datasets: (a) synthetic feedback data with known principles; (b) AlpacaEval cross-annotated human feedback data; (c) crowdsourced Chatbot Arena data; and (d) PRISM data from diverse demographic groups. As a short and interpretable representation of the original dataset, generated constitutions have many potential use cases: help identify undesirable annotator biases, understand model performance better, scale feedback to unseen data, or adapt models to individual user or group preferences. We release the source code at https://github.com/rdnfn/icai.

arxiv情報

著者	Arduin Findeis,Timo Kaufmann,Eyke Hüllermeier,Samuel Albanie,Robert Mullins
発行日	2025-04-21 15:37:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inverse Constitutional AI: Compressing Preferences into Principles

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー