HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment

要約

最近、テキストから分子に大規模な言語モデル（LLM）の成功を拡大することに関心が高まっています。
ほとんどの既存のアプローチでは、グラフニューラルネットワークを採用して、分子と言語のアライメントの一連のノードトークンとして分子を表していますが、分子の固有の階層構造を見落としています。
特に、高次分子構造には、分子の重要な生化学機能性をコードする官能基の豊富なセマンティクスが含まれています。
トークン化における階層情報を無視すると、分子と言語の並列と重度の幻覚がつながることを示します。
この制限に対処するために、階層グラフトークン化（HIGHT）を提案します。
Hightは、LLMSの分子知覚を改善するために、原子、モチーフ、および分子レベルの有益なトークンの階層をコードする階層グラフトークネイザーを採用しています。
Hightはまた、分子言語のアライメントをさらに強化するために、階層的なグラフ情報が濃縮された拡張命令チューニングデータセットを採用しています。
14の実世界のベンチマークでの広範な実験では、幻覚を40％減らす際のハイトの有効性、およびさまざまな分子言語の下流タスクの大幅な改善が検証されています。
このプロジェクトは、https：//higraphllm.github.io/で入手できます。

要約(オリジナル)

Recently, there has been a surge of interest in extending the success of large language models (LLMs) from texts to molecules. Most existing approaches adopt a graph neural network to represent a molecule as a series of node tokens for molecule-language alignment, which, however, have overlooked the inherent hierarchical structures in molecules. Notably, higher-order molecular structures contain rich semantics of functional groups, which encode crucial biochemical functionalities of the molecules. We show that neglecting the hierarchical information in tokenization will lead to subpar molecule-language alignment and severe hallucination. To address this limitation, we propose HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that encodes the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. HIGHT also adopts an augmented instruction tuning dataset, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. Extensive experiments on 14 real-world benchmarks verify the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks. The project is available at https: //higraphllm.github.io/.

arxiv情報

著者	Yongqiang Chen,Quanming Yao,Juzheng Zhang,James Cheng,Yatao Bian
発行日	2025-06-06 13:09:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HIGHT: Hierarchical Graph Tokenization for Molecule-Language Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー