Improving large language models with concept-aware fine-tuning

要約

大規模な言語モデル（LLM）は、現代AIの基礎となっています。
ただし、次のトークン予測の既存のパラダイムは、一貫した高レベルの概念を形成する能力を根本的に制限し、人間のような理解と推論に対する重要な障壁となっています。
「リボ核酸」というフレーズを例にとって、LLMは最初にトークンに分解します。つまり、人工テキストフラグメント（ ‘rib’、 ‘on’、…）に分解し、統一された一貫したセマンティックエンティティとしてフレーズを把握するのではなく、各トークンを順次学習します。
この断片化された表現は、より深い概念の理解、そして最終的には真にインテリジェントなシステムの開発を妨げます。
これに応じて、LLMがどのように微調整されているかを再定義する新しいマルチトークントレーニング方法であるコンセプト対応の微調整（CAFT）を紹介します。
複数のトークンにまたがるシーケンスの学習を可能にすることにより、この方法はより強力な概念認識学習を促進します。
私たちの実験は、テキストの要約などの従来のアプリケーションやDe Novoタンパク質設計などのドメイン固有のアプリケーションなど、多様なタスクにわたる従来のネクストトークンの微調整方法と比較して、大幅な改善を示しています。
マルチトークン予測は、以前は法外に高価な事前脱出段階でのみ可能でした。
私たちの知る限り、CAFTは、マルチトークンの設定をトレーニング後の段階に持ち込む最初のものであり、したがって、実務家や研究者のより広範なコミュニティにとって利益を効果的に民主化します。
最後に、提案された方法の予期せぬ有効性は、機械学習研究コミュニティにとってより広い意味を示唆しています。
すべてのコードとデータは、https：//github.com/michaelchen-lab/caft-llmで入手できます

要約(オリジナル)

Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase ‘ribonucleic acid’ as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments (‘rib’, ‘on’, …), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm

arxiv情報

著者	Michael K. Chen,Xikun Zhang,Jiaxing Huang,Dacheng Tao
発行日	2025-06-09 14:55:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving large language models with concept-aware fine-tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー