Measuring Vision-Language STEM Skills of Neural Models

要約

ニューラルモデルの STEM スキルをテストする新しい課題を導入します。
現実世界の問題には、多くの場合、STEM (科学、技術、工学、数学) の知識を組み合わせた解決策が必要です。
既存のデータセットとは異なり、私たちのデータセットは STEM のマルチモーダルな視覚言語情報を理解する必要があります。
私たちのデータセットは、この課題に対する最大かつ最も包括的なデータセットの 1 つを特徴としています。
すべての STEM 科目にわたる 448 のスキルと 1,073,146 の質問が含まれています。
専門家レベルの能力の調査に焦点を当てていることが多い既存のデータセットと比較して、私たちのデータセットには幼稚園から高校までのカリキュラムに基づいて設計された基本的なスキルと質問が含まれています。
また、CLIP や GPT-3.5-Turbo などの最先端の基盤モデルをベンチマークに追加します。
結果は、最近のモデルの進歩は、データセット内の非常に限られた数の下位レベルのスキル (3 年生で 2.5%) の習得にのみ役立つことを示しています。
実際、これらのモデルは、専門家レベルに近いパフォーマンスは言うまでもなく、小学生のパフォーマンスを依然として大幅に下回っています (平均 54.7%)。
データセットのパフォーマンスを理解し、向上させるために、データセットのトレーニング分割でモデルを学習します。
パフォーマンスの向上が観察されたとしても、モデルのパフォーマンスは平均的な小学生と比較して比較的低いままです。
STEM の問題を解決するには、コミュニティからの新しいアルゴリズムの革新が必要です。

要約(オリジナル)

We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.

arxiv情報

著者	Jianhao Shen,Ye Yuan,Srbuhi Mirzoyan,Ming Zhang,Chenguang Wang
発行日	2024-04-19 03:10:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Measuring Vision-Language STEM Skills of Neural Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー