Can adversarial attacks by large language models be attributed?

要約

サイバー攻撃や偽情報など、敵対的な環境で大規模言語モデル (LLM) からの出力を帰属させることには、今後ますます重要性が高まる重大な課題が存在します。
私たちは、形式言語理論、具体的には Gold によって導入され、Angluin によって拡張された極限における言語識別を使用して、この帰属の問題を調査します。
LLM 出力を形式言語としてモデル化することで、有限のテキストサンプルが元のモデルを一意に特定できるかどうかを分析します。
私たちの結果は、特定の言語クラスの非識別性のため、微調整されたモデルからの出力の重複に関するいくつかの穏やかな仮定の下では、出力を特定の LLM に確実に帰属させることは理論的に不可能であることを示しています。
これは、Transformer アーキテクチャの表現力の制限を考慮した場合にも当てはまります。
モデルへの直接アクセスや包括的なモニタリングを行ったとしても、大きな計算上のハードルがアトリビューションの取り組みを妨げます。
これらの調査結果は、敵対的 LLM の影響力が拡大し続ける中、敵対的 LLM の使用によってもたらされるリスクを軽減するための事前対策が緊急に必要であることを浮き彫りにしています。

要約(オリジナル)

Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation-presents significant challenges that are likely to grow in importance. We investigate this attribution problem using formal language theory, specifically language identification in the limit as introduced by Gold and extended by Angluin. By modeling LLM outputs as formal languages, we analyze whether finite text samples can uniquely pinpoint the originating model. Our results show that due to the non-identifiability of certain language classes, under some mild assumptions about overlapping outputs from fine-tuned models it is theoretically impossible to attribute outputs to specific LLMs with certainty. This holds also when accounting for expressivity limitations of Transformer architectures. Even with direct model access or comprehensive monitoring, significant computational hurdles impede attribution efforts. These findings highlight an urgent need for proactive measures to mitigate risks posed by adversarial LLM use as their influence continues to expand.

arxiv情報

著者	Manuel Cebrian,Jan Arne Telle
発行日	2024-11-12 18:28:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can adversarial attacks by large language models be attributed?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー