WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

要約

視覚と言語のモデルは、視覚的な質問応答などのタスクではうまく機能しますが、基本的な人間の常識的な推論スキルに関しては苦労します。
この作業では、WinoGAViL を紹介します。動的評価のベンチマークとして使用される視覚と言語の関連付け (たとえば、狼男と満月の間) のオンラインゲームです。
人気のカードゲーム Codenames に触発されて、スパイマスターがいくつかのビジュアル候補に関連するテキストキューを提供し、別のプレイヤーがそれらを識別しようとします。
人間のプレイヤーは、ライバルの AI モデルにとって挑戦的であるが、他の人間のプレイヤーによって解決可能な関連付けを作成することに対して報酬を与えられます。
私たちはこのゲームを使用して 3.5K インスタンスを収集し、それらが人間にとって直感的 (>90% Jaccard インデックス) であることを発見しましたが、最高のモデル (ViLT) が 52% のスコアを達成する最先端の AI モデルにとっては困難であることがわかりました
、キューが視覚的に際立っている場合にほとんど成功します。
私たちの分析とプレイヤーから収集したフィードバックは、収集された関連付けには、一般的な知識、常識、抽象化などを含む多様な推論スキルが必要であることを示しています。
データセット、コード、インタラクティブなゲームをリリースし、より優れた連想能力を持つモデルを開発するために使用できる将来のデータ収集を可能にします。

要約(オリジナル)

While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, allowing future data collection that can be used to develop models with better association abilities.

arxiv情報

著者	Yonatan Bitton,Nitzan Bitton Guetta,Ron Yosef,Yuval Elovici,Mohit Bansal,Gabriel Stanovsky,Roy Schwartz
発行日	2022-10-11 13:59:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー