Concept Bottleneck Language Models For protein design

要約

各ニューロンが解釈可能な概念に対応する層を持つ生成マスク言語モデルであるコンセプトボトルネックタンパク質言語モデル (CB-pLM) を紹介します。
私たちのアーキテクチャは 3 つの重要な利点を提供します。 i) 制御: コンセプト値に介入して、生成されたタンパク質の特性を正確に制御し、ベースラインと比較して望ましいコンセプト値の 3 倍の大きな変化を達成できます。
ii) 解釈可能性: 概念値と予測トークンの間の線形マッピングにより、モデルの意思決定プロセスの透過的な分析が可能になります。
iii) デバッグ: この透明性により、トレーニングされたモデルのデバッグが容易になります。
私たちのモデルは、従来のマスクされたタンパク質言語モデルに匹敵するトレーニング前の複雑さと下流タスクのパフォーマンスを達成し、解釈可能性がパフォーマンスを犠牲にしないことを実証しています。
あらゆる言語モデルに適応できますが、創薬における重要性と、実際の実験と専門知識を通じてモデルの機能を検証できるため、マスクされたタンパク質言語モデルに焦点を当てています。
当社は CB-pLM を 2,400 万パラメータから 30 億パラメータに拡張し、トレーニングされた最大のコンセプトボトルネックモデルであり、生成言語モデリングが可能な最初のモデルとなっています。

要約(オリジナル)

We introduce Concept Bottleneck Protein Language Models (CB-pLM), a generative masked language model with a layer where each neuron corresponds to an interpretable concept. Our architecture offers three key benefits: i) Control: We can intervene on concept values to precisely control the properties of generated proteins, achieving a 3 times larger change in desired concept values compared to baselines. ii) Interpretability: A linear mapping between concept values and predicted tokens allows transparent analysis of the model’s decision-making process. iii) Debugging: This transparency facilitates easy debugging of trained models. Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. While adaptable to any language model, we focus on masked protein language models due to their importance in drug discovery and the ability to validate our model’s capabilities through real-world experiments and expert knowledge. We scale our CB-pLM from 24 million to 3 billion parameters, making them the largest Concept Bottleneck Models trained and the first capable of generative language modeling.

arxiv情報

著者	Aya Abdelsalam Ismail,Tuomas Oikarinen,Amy Wang,Julius Adebayo,Samuel Stanton,Taylor Joren,Joseph Kleinhenz,Allen Goodman,Héctor Corrada Bravo,Kyunghyun Cho,Nathan C. Frey
発行日	2024-12-11 18:38:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Concept Bottleneck Language Models For protein design

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー