Reasoning Models Don’t Always Say What They Think

要約

Chain-of-Thought（COT）は、モデルのCOTを監視して意図と推論プロセスを理解しようとするため、AIの安全性の潜在的な恩恵を提供します。
ただし、このような監視の有効性は、モデルの実際の推論プロセスを忠実に表すCOTSにかかっています。
プロンプトで提示された6つの推論ヒントにわたって最先端の推論モデルのCOTの忠実さを評価し、テストしたほとんどの設定とモデルについて、COTはヒントを使用する例の少なくとも1％でヒントの使用を明らかにしますが、明らかなレートは20％未満です。
ヒントが使用され（報酬ハッキング）、コットモニターに対するトレーニングがなくても、それらを言葉で言語化する傾向は増加しません。
これらの結果は、COTの監視がトレーニングや評価中に望ましくない行動に気付くという有望な方法であるが、それらを除外するのに十分ではないことを示唆しています。
彼らはまた、COTの推論が必要ない私たちのような設定では、COTのテスト時間監視がまれで壊滅的な予期しない行動を確実に捕まえる可能性は低いことを示唆しています。

要約(オリジナル)

Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models’ actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

arxiv情報

著者	Yanda Chen,Joe Benton,Ansh Radhakrishnan,Jonathan Uesato,Carson Denison,John Schulman,Arushi Somani,Peter Hase,Misha Wagner,Fabien Roger,Vlad Mikulik,Samuel R. Bowman,Jan Leike,Jared Kaplan,Ethan Perez
発行日	2025-05-08 16:51:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reasoning Models Don’t Always Say What They Think

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー