Multimodal Tabular Reasoning with Privileged Structured Information

要約

表形式の推論には、表形式データに対するマルチステップ情報抽出と論理的推論が含まれます。
最近の進歩により、構造化されたテーブル上の推論のために大規模な言語モデル（LLM）が活用されていますが、このような高品質のテキスト表現は、通常、画像として表示される現実世界の設定では利用できないことがよくあります。
このホワイトペーパーでは、テーブル画像からの表形式の推論のタスクに取り組み、マルチモーダルの大手言語モデル（MLLM）を強化するためにトレーニング中に利用可能な特権構造情報を活用します。
重要な課題は、構造化された情報を視覚的表現と正確に調整することの複雑さにあり、入力モダリティギャップにもかかわらず、構造化された推論スキルをMLLMに効果的に転送することにあります。
これらに対処するために、特権構造化されたテーブルを使用したマルチモーダルの表形式の推論の新しいフレームワークである、ブリッジされた情報（{\ scターボ}）を使用して表形式の推論を紹介します。
{\ scターボ} deepseek-r1に基づいた構造対象の推論トレースジェネレーターの恩恵を受け、高品質のモダリティブリッジデータに貢献しています。
これに基づいて、{\ scターボ}は有利な推論パスを繰り返し生成および選択し、モデルの表形式の推論能力をさらに強化します。
実験結果は、限られた（$ 9 $ k）データで、{\ scターボ}が複数のデータセットで最新のパフォーマンス（$+7.2 \％$ vs.以前のSOTA）を達成することを示しています。

要約(オリジナル)

Tabular reasoning involves multi-step information extraction and logical inference over tabular data. While recent advances have leveraged large language models (LLMs) for reasoning over structured tables, such high-quality textual representations are often unavailable in real-world settings, where tables typically appear as images. In this paper, we tackle the task of tabular reasoning from table images, leveraging privileged structured information available during training to enhance multimodal large language models (MLLMs). The key challenges lie in the complexity of accurately aligning structured information with visual representations, and in effectively transferring structured reasoning skills to MLLMs despite the input modality gap. To address these, we introduce TabUlar Reasoning with Bridged infOrmation ({\sc Turbo}), a new framework for multimodal tabular reasoning with privileged structured tables. {\sc Turbo} benefits from a structure-aware reasoning trace generator based on DeepSeek-R1, contributing to high-quality modality-bridged data. On this basis, {\sc Turbo} repeatedly generates and selects the advantageous reasoning paths, further enhancing the model’s tabular reasoning ability. Experimental results demonstrate that, with limited ($9$k) data, {\sc Turbo} achieves state-of-the-art performance ($+7.2\%$ vs. previous SOTA) across multiple datasets.

arxiv情報

著者	Jun-Peng Jiang,Yu Xia,Hai-Long Sun,Shiyin Lu,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,De-Chuan Zhan,Han-Jia Ye
発行日	2025-06-04 15:46:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Tabular Reasoning with Privileged Structured Information

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー