A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process

要約

このペーパーでは、言語モデル（LLM）が自分の考えの連鎖推論を通じて潜在的なバイアスまたは毒性をどのように自己増加するかを理解するための連続時間の確率的動的フレームワークを紹介します。
モデルは、ドリフト用語$ \ mu（x）$および拡散$ \ sigma（x）を使用して、確率的微分方程式（SDE）の下で進化する[0,1] $ in in [0,1] $の瞬間的な「重大度」変数$ x（t）\を仮定します。
$。
重要なことに、このようなプロセスは、Fokker-Planckアプローチを介して一貫して分析できます。
分析では、重要な現象を調査し、特定のパラメーターレジームがサブクリティカル（自己修正）から超臨界（暴走の重症度）への位相遷移を作成することを示しています。
この論文は、定常分布、最初のパサージ時間を有害なしきい値に導き、重要なポイント近くの法則を拡大することを導き出します。
最後に、エージェントと拡張されたLLM推論モデルへの影響を強調しています。原則として、これらの方程式は、モデルが安定したままであるか、繰り返し推論よりもバイアスを伝播するかを正式に検証する基礎として役立つ可能性があります。

要約(オリジナル)

This paper introduces a continuous-time stochastic dynamical framework for understanding how large language models (LLMs) may self-amplify latent biases or toxicity through their own chain-of-thought reasoning. The model posits an instantaneous ‘severity’ variable $x(t) \in [0,1]$ evolving under a stochastic differential equation (SDE) with a drift term $\mu(x)$ and diffusion $\sigma(x)$. Crucially, such a process can be consistently analyzed via the Fokker–Planck approach if each incremental step behaves nearly Markovian in severity space. The analysis investigates critical phenomena, showing that certain parameter regimes create phase transitions from subcritical (self-correcting) to supercritical (runaway severity). The paper derives stationary distributions, first-passage times to harmful thresholds, and scaling laws near critical points. Finally, it highlights implications for agents and extended LLM reasoning models: in principle, these equations might serve as a basis for formal verification of whether a model remains stable or propagates bias over repeated inferences.

arxiv情報

著者	Jack David Carson
発行日	2025-01-28 08:08:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Stochastic Dynamical Theory of LLM Self-Adversariality: Modeling Severity Drift as a Critical Process

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー