A Reply to Makelov et al. (2023)’s ‘Interpretability Illusion’ Arguments

要約

Makelovらによる最近の論文に応答します。
(2023) は、分散型アライメント検索 (DAS; Geiger et al. 2023) のような部分空間交換介入手法をレビューし、これらの手法が「解釈可能性の錯覚」を引き起こす可能性があると主張しています。
最初に Makelov らをレビューします。
(2023) は、「解釈可能性の錯覚」とは何かという技術的な概念を研究し、直感的で望ましい説明であっても、この意味では錯覚とみなされる可能性があることを示します。
その結果、彼らの「幻想」を発見する方法は、彼らが「非幻想」と考える説明を拒否する可能性があります。
次に、Makelov らの幻想は次のようなものであると主張します。
(2023) 実際に見られるのは、トレーニングと評価のパラダイムの成果物です。
私たちは、彼らの核心的な特徴付けには同意できないものの、Makelov et al. が次のことを行っていることを強調して終わります。
(2023) の例と議論は間違いなく解釈可能性の分野を前進させました。

要約(オリジナル)

We respond to the recent paper by Makelov et al. (2023), which reviews subspace interchange intervention methods like distributed alignment search (DAS; Geiger et al. 2023) and claims that these methods potentially cause ‘interpretability illusions’. We first review Makelov et al. (2023)’s technical notion of what an ‘interpretability illusion’ is, and then we show that even intuitive and desirable explanations can qualify as illusions in this sense. As a result, their method of discovering ‘illusions’ can reject explanations they consider ‘non-illusory’. We then argue that the illusions Makelov et al. (2023) see in practice are artifacts of their training and evaluation paradigms. We close by emphasizing that, though we disagree with their core characterization, Makelov et al. (2023)’s examples and discussion have undoubtedly pushed the field of interpretability forward.

arxiv情報

著者	Zhengxuan Wu,Atticus Geiger,Jing Huang,Aryaman Arora,Thomas Icard,Christopher Potts,Noah D. Goodman
発行日	2024-01-23 10:27:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Reply to Makelov et al. (2023)’s ‘Interpretability Illusion’ Arguments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー