Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

要約

小規模なグラフ上で単純なビジュアルネットワーク分析 (VNA) タスクを実行するための GPT-4 と LLaVa のゼロショット機能を評価します。
私たちは、3 つの基本的なネットワーク科学概念に関連する 5 つのタスクでビジョン言語モデル (VLM) を評価します。レンダリングされたグラフ上の最大次数のノードの特定、符号付きトライアドがバランスが取れているかアンバランスかを特定、コンポーネントのカウントです。
タスクは、基礎となるグラフ理論の概念を理解している人間にとって簡単にできるように構成されており、グラフ内の適切な要素を数えることによってすべて解決できます。
GPT-4 は常に LLaVa より優れたパフォーマンスを示しますが、どちらのモデルも、私たちが提案するすべてのビジュアルネットワーク分析タスクに苦戦していることがわかりました。
基本的な VNA タスクに関する VLM を評価するための最初のベンチマークを一般公開します。

要約(オリジナル)

We evaluate the zero-shot ability of GPT-4 and LLaVa to perform simple Visual Network Analysis (VNA) tasks on small-scale graphs. We evaluate the Vision Language Models (VLMs) on 5 tasks related to three foundational network science concepts: identifying nodes of maximal degree on a rendered graph, identifying whether signed triads are balanced or unbalanced, and counting components. The tasks are structured to be easy for a human who understands the underlying graph theoretic concepts, and can all be solved by counting the appropriate elements in graphs. We find that while GPT-4 consistently outperforms LLaVa, both models struggle with every visual network analysis task we propose. We publicly release the first benchmark for the evaluation of VLMs on foundational VNA tasks.

arxiv情報

著者	Evan M. Williams,Kathleen M. Carley
発行日	2024-05-10 17:51:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー