Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

要約

OpenAI の GPT-4V(ision) など、マルチモーダル大規模言語モデル (MLLM) への関心の高まりは、学術分野と産業分野の両方に大きな影響を与えています。
これらのモデルは、高度な視覚的理解機能で大規模言語モデル (LLM) を強化し、さまざまなマルチモーダルタスクへの適用を容易にします。
最近、Google は、マルチモーダル統合向けに特別に設計された最先端の MLLM である Gemini を導入しました。
進歩にもかかわらず、予備的なベンチマークは、Gemini が常識的な推論タスクにおいて GPT モデルに遅れをとっていることを示しています。
ただし、この評価は限られたデータセット (つまり、HellaSWAG) に基づいており、ジェミニの真の常識的推論の可能性を完全には捉えていません。
このギャップに対処するために、私たちの研究では、モダリティ全体の常識知識の統合を必要とする複雑な推論タスクにおけるジェミニのパフォーマンスの徹底的な評価を実施します。
一般的なタスクから分野固有のタスクに至るまで、12 の常識推論データセットの包括的な分析を実行します。
これには、言語のみに焦点を当てた 11 個のデータセットと、マルチモーダル要素を組み込んだデータセットが含まれます。
4 つの LLM と 2 つの MLLM にわたる実験により、Gemini の競争力のある常識的推論能力が実証されました。
さらに、常識的な問題に対処する際に現在の LLM と MLLM が直面する共通の課題を特定し、これらのモデルの常識的な推論能力を強化するさらなる進歩の必要性を強調します。

要約(オリジナル)

The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI’s GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini’s authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini’s performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini’s competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.

arxiv情報

著者	Yuqing Wang,Yun Zhao
発行日	2023-12-29 15:57:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー