Schema Matching with Large Language Models: an Experimental Study

要約

大規模言語モデル (LLM) は、データラングリングなどのさまざまなタスクに有用なアプリケーションを示しています。
このペーパーでは、スキーママッチングに既製の LLM を使用する方法を調査します。
私たちの目的は、名前と説明のみを使用して 2 つのリレーショナルスキーマの要素間の意味上の対応関係を特定することです。
健康ドメインから新しく作成されたベンチマークを使用して、さまざまないわゆるタスクスコープを提案します。
これらは、LLM にスキーママッチングを実行するように指示するためのメソッドであり、プロンプトに含まれるコンテキスト情報の量は異なります。
これらのタスクスコープを使用して、LLM ベースのスキーママッチングを文字列類似性ベースラインと比較し、マッチングの品質、検証の労力、決断力、アプローチの相補性を調査します。
マッチングの品質は、コンテキスト情報の欠如だけでなく、提供されすぎるコンテキスト情報によっても低下することがわかりました。
一般に、新しい LLM バージョンを使用すると、決定力が高まります。
検証作業が許容できるタスクスコープを特定し、かなりの数の真の意味的一致を特定することに成功します。
私たちの調査では、LLM にはスキーママッチングプロセスをブートストラップする可能性があり、データインスタンスを必要とせずにスキーマ要素の名前と説明のみに基づいてこのタスクを高速化するデータエンジニアを支援できることが示されています。

要約(オリジナル)

Large Language Models (LLMs) have shown useful applications in a variety of tasks, including data wrangling. In this paper, we investigate the use of an off-the-shelf LLM for schema matching. Our objective is to identify semantic correspondences between elements of two relational schemas using only names and descriptions. Using a newly created benchmark from the health domain, we propose different so-called task scopes. These are methods for prompting the LLM to do schema matching, which vary in the amount of context information contained in the prompt. Using these task scopes we compare LLM-based schema matching against a string similarity baseline, investigating matching quality, verification effort, decisiveness, and complementarity of the approaches. We find that matching quality suffers from a lack of context information, but also from providing too much context information. In general, using newer LLM versions increases decisiveness. We identify task scopes that have acceptable verification effort and succeed in identifying a significant number of true semantic matches. Our study shows that LLMs have potential in bootstrapping the schema matching process and are able to assist data engineers in speeding up this task solely based on schema element names and descriptions without the need for data instances.

arxiv情報

著者	Marcel Parciak,Brecht Vandevoort,Frank Neven,Liesbet M. Peeters,Stijn Vansummeren
発行日	2024-07-16 15:33:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Schema Matching with Large Language Models: an Experimental Study

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー