Diffusion Language Models Are Versatile Protein Learners

要約

この論文では、タンパク質配列の強力な生成および予測能力を実証する多用途のタンパク質言語モデルである拡散タンパク質言語モデル (DPLM) を紹介します。
まず、原理的な方法でタンパク質の言語モデリングを一般化する、生成自己教師あり離散拡散確率フレームワーク内で進化スケールのタンパク質配列からスケーラブルな DPLM を事前トレーニングします。
事前トレーニング後、DPLM は、構造的に妥当で新規かつ多様なタンパク質配列を無条件に生成する能力を示します。
さらに、提案された拡散生成事前トレーニングにより、DPLM がタンパク質をより深く理解し、ESM2 と比べて優れた、さまざまな予測タスクに合わせて微調整できる優れた表現学習器となることを実証します (Lin et al., 2022)。
さらに、DPLM はさまざまなニーズに合わせて調整でき、いくつかの方法で条件付き生成の優れた能力を示します。(1) 部分ペプチド配列に対する条件付け、たとえば、高い成功率で機能モチーフの足場を生成します。
（２）コンディショナーとして他のモダリティを組み込む（例えば、逆折り畳みのための構造条件付き生成）。
(3) プラグアンドプレイ分類器ガイダンスを介した、例えば指定された二次構造を満たすような、所望の特性に向けたステアリングシーケンスの生成。

要約(オリジナル)

This paper introduces diffusion protein language model (DPLM), a versatile protein language model that demonstrates strong generative and predictive capabilities for protein sequences. We first pre-train scalable DPLMs from evolutionary-scale protein sequences within a generative self-supervised discrete diffusion probabilistic framework, which generalizes language modeling for proteins in a principled way. After pre-training, DPLM exhibits the ability to generate structurally plausible, novel, and diverse protein sequences for unconditional generation. We further demonstrate the proposed diffusion generative pre-training makes DPLM possess a better understanding of proteins, making it a superior representation learner, which can be fine-tuned for various predictive tasks, comparing favorably to ESM2 (Lin et al., 2022). Moreover, DPLM can be tailored for various needs, which showcases its prowess of conditional generation in several ways: (1) conditioning on partial peptide sequences, e.g., generating scaffolds for functional motifs with high success rate; (2) incorporating other modalities as conditioner, e.g., structure-conditioned generation for inverse folding; and (3) steering sequence generation towards desired properties, e.g., satisfying specified secondary structures, through a plug-and-play classifier guidance.

arxiv情報

著者	Xinyou Wang,Zaixiang Zheng,Fei Ye,Dongyu Xue,Shujian Huang,Quanquan Gu
発行日	2024-02-28 18:57:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Diffusion Language Models Are Versatile Protein Learners

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー