Enhancing DNA Foundation Models to Address Masking Inefficiencies

要約

事前トレーニング目標としてのマスク言語モデリング（MLM）は、ゲノムシーケンスモデリングで広く採用されています。
事前に保護されたモデルは、さまざまなダウンストリームタスクのエンコーダーとして正常に機能する可能性がありますが、事前削除と推論の間の分布シフトは、[マスク]トークンを予測にマッピングすることですが、[マスク]はダウンストリームアプリケーション中には存在しないため、パフォーマンスに影響を与えます。
これは、エンコーダーが非[マスク]トークンのエンコーディングを優先せず、パラメーターを消費し、展開時間が無関係であるにもかかわらず、MLMタスクに関連する作業のみを計算します。
この作業では、BERTベースのトランス内でのこの非効率性に対処するように設計されたマスクされた自動エンコーダーフレームワークに基づいた修正されたエンコーダーデコダーアーキテクチャを提案します。
結果として生じるミスマッチは、モデルが微調整せずに特徴抽出によく使用されることが多いゲノムパイプラインで特に有害であることを経験的に示します。
200万人以上のユニークなDNAバーコードを含むBioscan-5Mデータセットでのアプローチを評価します。
MLMタスクで前提とした因果モデルと双方向アーキテクチャと比較した場合、閉じた世界とオープンワールドの両方の分類タスクでかなりのパフォーマンスの向上を達成します。

要約(オリジナル)

Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.

arxiv情報

著者	Monireh Safari,Pablo Millan Arias,Scott C. Lowe,Lila Kari,Angel X. Chang,Graham W. Taylor
発行日	2025-02-25 17:56:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing DNA Foundation Models to Address Masking Inefficiencies

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー