Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

要約

データ管理コミュニティの長年の目標は、人間の努力やドメイン固有のカスタマイズなしで、半構造化されたドキュメントと出力クエリテーブルを摂取する一般的な自動化されたシステムを開発することです。
多様な潜在的なドキュメントを考えると、最先端のシステムは、仮定を簡素化し、ドメイン固有のトレーニングを使用します。
この作業では、大規模な言語モデル（LLM）を使用して一般性を維持できるかどうかを尋ねます。
幅広いデータで事前に処理されているLLMは、自然言語タスクの説明を単純に条件付けした多様なダウンストリームタスクを実行できます。
LLMSを搭載したシンプルなプロトタイプシステムであるEvaPorateを提案および評価します。
このシステムを実装するための2つの根本的に異なる戦略を特定します。LLMにドキュメントから値を直接抽出するように促すか、LLMに抽出を実行するコードを合成するように促します。
私たちの評価は、これら2つのアプローチ間のコスト品質のトレードオフを示しています。
コード合成は安価ですが、各ドキュメントをLLMで直接処理するよりもはるかに正確ではありません。
低コストを維持しながら品質を向上させるために、拡張コード合成の実装である蒸発コード+を提案します。これは、直接抽出よりも優れた品質を達成します。
私たちの重要な洞察は、多くの候補機能を生成し、弱い監督を使用して抽出をアンサンブルすることです。
EvaPorate-Code+は、最先端のシステムを上回るだけでなく、LLMを使用してドキュメントを越えてサブリンパスを使用してそうします。
これは、LLMが処理するために必要なトークンの数の110倍の削減に相当し、それぞれ10Kドキュメントの16の実際の評価設定で平均化されました。

要約(オリジナル)

A long standing goal of the data management community is to develop general, automated systems that ingest semi-structured documents and output queryable tables without human effort or domain specific customization. Given the sheer variety of potential documents, state-of-the art systems make simplifying assumptions and use domain specific training. In this work, we ask whether we can maintain generality by using large language models (LLMs). LLMs, which are pretrained on broad data, can perform diverse downstream tasks simply conditioned on natural language task descriptions. We propose and evaluate EVAPORATE, a simple, prototype system powered by LLMs. We identify two fundamentally different strategies for implementing this system: prompt the LLM to directly extract values from documents or prompt the LLM to synthesize code that performs the extraction. Our evaluations show a cost-quality tradeoff between these two approaches. Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. To improve quality while maintaining low cost, we propose an extended code synthesis implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction. Our key insight is to generate many candidate functions and ensemble their extractions using weak supervision. EVAPORATE-CODE+ not only outperforms the state-of-the art systems, but does so using a sublinear pass over the documents with the LLM. This equates to a 110x reduction in the number of tokens the LLM needs to process, averaged across 16 real-world evaluation settings of 10k documents each.

arxiv情報

著者	Simran Arora,Brandon Yang,Sabri Eyuboglu,Avanika Narayan,Andrew Hojel,Immanuel Trummer,Christopher Ré
発行日	2025-03-07 17:33:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー