Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

要約

従来の類似性に基づくスキーママッチング方法では、常識やドメイン固有の知識が欠けているため、ドメイン固有の複雑なマッピングシナリオにおける意味論的な曖昧さや矛盾を解決できません。
大規模言語モデル (LLM) の幻覚の問題も、LLM ベースのスキーママッチングで上記の問題に対処することを困難にしています。
したがって、KG-RAG4SM と呼ばれる、スキーママッチングのためのナレッジグラフベースの検索拡張生成モデルを提案します。
特に、KG-RAG4SM は、新しいベクトルベース、グラフトラバーサルベース、およびクエリベースのグラフ検索に加え、外部の大規模ナレッジグラフ (KG) から最も関連性の高いサブグラフを特定するハイブリッドアプローチとランキングスキームを導入します。
KG ベースの検索拡張 LLM が、再トレーニングなしで複雑な一致ケースに対してより正確な結果を生成できることを紹介します。
私たちの実験結果は、KG-RAG4SM が、MIMIC データセットの精度と F1 スコアの点で、LLM ベースの最先端 (SOTA) 手法 (Jellyfish-8B など) よりもそれぞれ 35.89% と 30.50% 優れていることを示しています。
;
GPT-4o-mini を搭載した KG-RAG4SM は、Synthea データセットの精度と F1 スコアの点で、事前トレーニング言語モデル (PLM) ベースの SOTA メソッド (SMAT など) よりもそれぞれ 69.20% と 21.97% 優れています。
この結果は、私たちのアプローチがエンドツーエンドのスキーマ照合においてより効率的であり、大規模な KG から取得できるように拡張できることも示しています。
現実世界のスキーママッチングシナリオのデータセットに関する当社のケーススタディでは、スキーママッチングにおける LLM の幻覚問題が当社のソリューションによって十分に軽減されることが示されています。

要約(オリジナル)

Traditional similarity-based schema matching methods are incapable of resolving semantic ambiguities and conflicts in domain-specific complex mapping scenarios due to missing commonsense and domain-specific knowledge. The hallucination problem of large language models (LLMs) also makes it challenging for LLM-based schema matching to address the above issues. Therefore, we propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces novel vector-based, graph traversal-based, and query-based graph retrievals, as well as a hybrid approach and ranking schemes that identify the most relevant subgraphs from external large knowledge graphs (KGs). We showcase that KG-based retrieval-augmented LLMs are capable of generating more accurate results for complex matching cases without any re-training. Our experimental results show that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g., Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and 21.97% in terms of precision and F1 score on the Synthea dataset, respectively. The results also demonstrate that our approach is more efficient in end-to-end schema matching, and scales to retrieve from large KGs. Our case studies on the dataset from the real-world schema matching scenario exhibit that the hallucination problem of LLMs for schema matching is well mitigated by our solution.

arxiv情報

著者	Chuangtao Ma,Sriom Chakrabarti,Arijit Khan,Bálint Molnár
発行日	2025-01-15 09:32:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー