PRISM: A Methodology for Auditing Biases in Large Language Models

要約

大規模言語モデル (LLM) を監査してそのバイアスや好みを発見することは、責任ある人工知能 (AI) を作成する際の新たな課題です。
このようなモデルの好みを引き出すためにさまざまな方法が提案されていますが、LLM トレーナーは、LLM が特定の主題についての立場の開示を隠す、難読化する、または真っ白に拒否するなどの対策を講じてきました。
この論文は、LLM を監査するための柔軟な照会ベースの方法論である PRISM を紹介します。これは、上記の好みの直接的な照会ではなく、タスクベースの照会プロンプトを通じて間接的にそのようなポジションを不正にしようとします。
この方法論の有用性を実証するために、私たちは PRISM を政治コンパステストに適用し、7 つのプロバイダーからの 21 の LLM の政治的傾向を評価しました。
私たちは、LLMがデフォルトで、経済的に左派で社会的にリベラルな立場を支持していることを示します（以前の研究と一致しています）。
また、これらのモデルが喜んで支持する立場の空間も示します。一部のモデルは他のモデルよりも制約が強く、従順さが低い一方で、他のモデルはより中立的で客観的です。
つまり、PRISM は LLM をより確実に調査および監査して、その好み、バイアス、制約を理解できるようになります。

要約(オリジナル)

Auditing Large Language Models (LLMs) to discover their biases and preferences is an emerging challenge in creating Responsible Artificial Intelligence (AI). While various methods have been proposed to elicit the preferences of such models, countermeasures have been taken by LLM trainers, such that LLMs hide, obfuscate or point blank refuse to disclosure their positions on certain subjects. This paper presents PRISM, a flexible, inquiry-based methodology for auditing LLMs – that seeks to illicit such positions indirectly through task-based inquiry prompting rather than direct inquiry of said preferences. To demonstrate the utility of the methodology, we applied PRISM on the Political Compass Test, where we assessed the political leanings of twenty-one LLMs from seven providers. We show LLMs, by default, espouse positions that are economically left and socially liberal (consistent with prior work). We also show the space of positions that these models are willing to espouse – where some models are more constrained and less compliant than others – while others are more neutral and objective. In sum, PRISM can more reliably probe and audit LLMs to understand their preferences, biases and constraints.

arxiv情報

著者	Leif Azzopardi,Yashar Moshfeghi
発行日	2024-10-24 16:57:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PRISM: A Methodology for Auditing Biases in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー