Circuit Stability Characterizes Language Model Generalization

要約

（大）言語モデルの機能を広く評価することは困難です。
最先端のモデルの急速な発展により、ベンチマークの飽和が誘発されますが、より挑戦的なデータセットを作成することは労働集約的です。
メカニズムの解釈可能性の最近の開発に触発されて、モデルのパフォーマンスを評価する新しい方法として回路の安定性を紹介します。
回路の安定性とは、一貫した推論プロセス、つまりさまざまな入力にアクセスする回路回路を適用するモデルの能力を指します。
回路の安定性と回路の等価性を数学的に形式化します。
次に、3つのケーススタディを通じて、回路の安定性とその欠如が一般化のさまざまな側面を特徴付けて予測できることを経験的に示します。
私たちの提案された方法は、モデルの一般性をその解釈可能性に厳密に関連付けるための一歩を提供します。

要約(オリジナル)

Extensively evaluating the capabilities of (large) language models is difficult. Rapid development of state-of-the-art models induce benchmark saturation, while creating more challenging datasets is labor-intensive. Inspired by the recent developments in mechanistic interpretability, we introduce circuit stability as a new way to assess model performance. Circuit stability refers to a model’s ability to apply a consistent reasoning process-its circuit-across various inputs. We mathematically formalize circuit stability and circuit equivalence. Then, through three case studies, we empirically show that circuit stability and the lack thereof can characterize and predict different aspects of generalization. Our proposed methods offer a step towards rigorously relating the generality of models to their interpretability.

arxiv情報

著者	Alan Sun
発行日	2025-05-30 15:53:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Circuit Stability Characterizes Language Model Generalization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー