Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

要約

ChatGPTやGeminiなどの大規模な言語モデルの出現は、自然言語理解からコード生成まで、その多様な能力を評価することの重要性を浮き彫りにしている。しかし、空間タスクにおける性能は十分に評価されていない。本研究では、空間タスクにおける複数の高度なモデルの性能を系統的に探索し、比較するために設計された新しいマルチタスク空間評価データセットを導入することで、このギャップに対処する。このデータセットには、空間理解や単純な経路計画など12の異なるタスクタイプが含まれ、それぞれ検証された正確な解答が用意されている。我々は、OpenAIのgpt-3.5-turbo、gpt-4-turbo、gpt-4o、ZhipuAIのglm-4、Anthropicのclaude-3-sonnet-20240229、MoonShotのmoonshot-v1-8kを含む複数のモデルを、2段階のテストアプローチを用いて評価した。まず、ゼロショットテストを実施した。次に、データセットを難易度別に分類し、プロンプトチューニングテストを行った。その結果、第一段階ではgpt-4oが平均71.3%と最も高い総合精度を達成した。moonshot-v1-8kは全体ではやや劣るものの、地名認識タスクではgpt-4oを上回った。本研究では、特定のタスクにおけるプロンプト戦略がモデルのパフォーマンスに与える影響も明らかにしている。例えば、Chain-of-Thought（CoT）戦略は単純な経路計画におけるgpt-4oの精度を12.4%から87.5%に向上させ、ワンショット戦略は地図作成タスクにおけるmoonshot-v1-8kの精度を10.1%から76.3%に向上させた。

要約(オリジナル)

The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI’s gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI’s glm-4, Anthropic’s claude-3-sonnet-20240229, and MoonShot’s moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o’s accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k’s accuracy in mapping tasks from 10.1% to 76.3%.

arxiv情報

著者	Liuchang Xu,Shuo Zhao,Qingming Lin,Luyao Chen,Qianqian Luo,Sensen Wu,Xinyue Ye,Hailin Feng,Zhenhong Du
発行日	2025-01-03 03:03:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー