Multimodal Table Understanding

要約

大規模言語モデル (LLM) に基づく最近のアプローチを含む、以前のテーブル理解方法によって大きな進歩が見られましたが、それらは、特定のテーブルを特定のテキストシーケンス (マークダウンや HTML など) に変換して機能させる必要があるという前提に大きく依存しています。
モデル入力。
ただし、実際のシナリオによっては、このような高品質のテキスト表表現にアクセスすることは困難であり、表イメージの方がはるかにアクセスしやすいです。
したがって、直感的な視覚情報を使用してテーブルを直接理解する方法は、より実用的なアプリケーションを開発する上で重要かつ緊急の課題です。
この論文では、モデルが与えられたテーブルイメージに基づいてテーブル関連のさまざまなリクエストに対する正しい応答を生成する必要がある、マルチモーダルテーブル理解という新しい問題を提案します。
モデルのトレーニングと評価の両方を容易にするために、広範囲のテーブルイメージ、指示、タスクをカバーする MMTab という名前の大規模なデータセットを構築します。
これに基づいて、私たちは汎用的な表形式マルチモーダル大規模言語モデル (MLLM) である Table-LLaVA を開発しました。これは、ホールドイン設定およびホールドアウト設定の下で 23 のベンチマークで最近のオープンソース MLLM ベースラインを大幅に上回ります。
コードとデータは、https://github.com/SpursGoZmy/Table-LLaVA から入手できます。

要約(オリジナル)

Although great progress has been made by previous table understanding methods including recent approaches based on large language models (LLMs), they rely heavily on the premise that given tables must be converted into a certain text sequence (such as Markdown or HTML) to serve as model input. However, it is difficult to access such high-quality textual table representations in some real-world scenarios, and table images are much more accessible. Therefore, how to directly understand tables using intuitive visual information is a crucial and urgent challenge for developing more practical applications. In this paper, we propose a new problem, multimodal table understanding, where the model needs to generate correct responses to various table-related requests based on the given table image. To facilitate both the model training and evaluation, we construct a large-scale dataset named MMTab, which covers a wide spectrum of table images, instructions and tasks. On this basis, we develop Table-LLaVA, a generalist tabular multimodal large language model (MLLM), which significantly outperforms recent open-source MLLM baselines on 23 benchmarks under held-in and held-out settings. The code and data is available at this https://github.com/SpursGoZmy/Table-LLaVA

arxiv情報

著者	Mingyu Zheng,Xinwei Feng,Qingyi Si,Qiaoqiao She,Zheng Lin,Wenbin Jiang,Weiping Wang
発行日	2024-06-12 11:27:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Table Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー