Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

要約

AIシステムの複雑さの増加により、行動が重要になりました。
モデルの動作を、入力機能、トレーニングデータ、および内部モデルコンポーネントの3つの重要な側面に起因する多数の解釈可能性方法が開発されています。これは、説明可能なAI、データ中心のAI、および機構的解釈可能性からそれぞれ出現しました。
ただし、これらの帰属方法はかなり独立して研究および適用されているため、方法と用語の断片化された風景が生じます。
このポジションペーパーでは、特徴、データ、およびコンポーネントの帰属方法が基本的な類似性を共有しており、それらの統一された見解は解釈可能性とより広範なAI研究の両方に役立つと主張しています。
この目的のために、最初にこれらの3種類の属性の一般的な方法を分析し、これらの一見明確な方法が異なる側面で同様の手法（摂動、勾配、線形近似など）を使用していることを示す統一ビューを提示します。
次に、この統一されたビューが既存の帰属方法の理解を高め、これらの方法の間で共有された概念と評価基準を強調する方法を示し、共通の課題に対処し、モデルの編集、操縦、および規制の適用により、共通の課題に対処し、より広くAIで新しい研究の方向性につながります。

要約(オリジナル)

The increasing complexity of AI systems has made understanding their behavior critical. Numerous interpretability methods have been developed to attribute model behavior to three key aspects: input features, training data, and internal model components, which emerged from explainable AI, data-centric AI, and mechanistic interpretability, respectively. However, these attribution methods are studied and applied rather independently, resulting in a fragmented landscape of methods and terminology. This position paper argues that feature, data, and component attribution methods share fundamental similarities, and a unified view of them benefits both interpretability and broader AI research. To this end, we first analyze popular methods for these three types of attributions and present a unified view demonstrating that these seemingly distinct methods employ similar techniques (such as perturbations, gradients, and linear approximations) over different aspects and thus differ primarily in their perspectives rather than techniques. Then, we demonstrate how this unified view enhances understanding of existing attribution methods, highlights shared concepts and evaluation criteria among these methods, and leads to new research directions both in interpretability research, by addressing common challenges and facilitating cross-attribution innovation, and in AI more broadly, with applications in model editing, steering, and regulation.

arxiv情報

著者	Shichang Zhang,Tessa Han,Usha Bhalla,Himabindu Lakkaraju
発行日	2025-05-29 16:49:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー