RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning

要約

最近、Vision Language Models（VLM）は、より良い人間コンピューターの相互作用、アクセシビリティ、および詳細な理解を実現するために、ドキュメントの視覚的基盤をますます強調しています。
ただし、チャートなどの視覚化への適用は、チャート画像におけるインターリーブされた視覚的数値関係の固有の複雑さのために、依存していないままです。
既存のチャート理解方法は、予測をサポートする視覚的要素を明示的に識別することなく、主に質問に答えることに焦点を当てています。
このギャップを埋めるために、チャートの質問応答（Chartqa）を視覚的な接地と統合する新しいベンチマークであるRefchartqaを紹介し、チャート画像内の複数の粒度の要素を参照できるようにします。
さらに、さまざまなカテゴリで5つの最先端のVLMを指導することにより、包括的な評価を実施します。
私たちの実験は、接地を介して空間的認識を組み込むことで、応答の精度が15％を超え、幻覚を減らし、モデルの信頼性を向上させることを示しています。
さらに、TinyChartのアーキテクチャの改善など、テキスト空間の調整に影響を与える重要な要因を特定します。これは、機能融合の強化されたトークンマージモジュールを活用します。
私たちのデータセットは、コミュニティ開発とさらなる進歩のためにオープンソーシングされています。
すべてのモデルとコードは、https://github.com/moured/refchartqaで公開されます。

要約(オリジナル)

Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at https://github.com/moured/RefChartQA.

arxiv情報

著者	Alexander Vogel,Omar Moured,Yufan Chen,Jiaming Zhang,Rainer Stiefelhagen
発行日	2025-06-18 13:17:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー