Controlling Language and Diffusion Models by Transporting Activations

要約

大規模な生成モデルの機能が向上し、その導入がますます広範囲に行われるようになったことで、その信頼性、安全性、誤用の可能性についての懸念が生じています。
これらの問題に対処するために、最近の研究では、生成された出力における概念や動作の出現を効果的に誘発または防止するために、モデルのアクティベーションを操作することによってモデル生成を制御することが提案されています。
この論文では、これまでの多くのアクティベーションステアリング作業を一般化する、最適トランスポート理論に基づいてアクティベーションを操作するための一般的なフレームワークであるアクティベーショントランスポート (AcT) を紹介します。
AcT はモダリティに依存せず、モデルの能力への影響を最小限に抑えながら、無視できる計算オーバーヘッドでモデルの動作をきめ細かく制御できます。
私たちは、大規模言語モデル (LLM) とテキストから画像への拡散モデル (T2I) における主要な課題に対処することで、アプローチの有効性と多用途性を実験的に示しています。
LLM については、AcT が毒性を効果的に緩和し、恣意的な概念を誘導し、その真実性を高めることができることを示します。
T2I では、AcT がどのようにしてきめ細かいスタイル制御と概念の否定を可能にするかを示します。

要約(オリジナル)

The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

arxiv情報

著者	Pau Rodriguez,Arno Blaas,Michal Klein,Luca Zappella,Nicholas Apostoloff,Marco Cuturi,Xavier Suau
発行日	2024-11-22 16:04:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Controlling Language and Diffusion Models by Transporting Activations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー