TraVLR: Now You See It, Now You Don’t! A Bimodal Dataset for Evaluating Visio-Linguisic Reasoning

要約

視覚言語表現（Visio-Linguistic：V+L）学習法は数多く開発されているが、既存のデータセットでは、視覚と言語の概念をどの程度統一的な空間で表現しているかが適切に評価されているとは言えない。我々は、V+Lモデルの評価設定として、クロスモーダル転送を含むいくつかの新しい設定を提案する。さらに、既存のV+Lベンチマークは、データセット全体のグローバルな精度スコアを報告することが多く、モデルが失敗したり成功したりする特定の推論タスクを特定することを困難にしている。我々は、4つのV+L推論タスクからなる合成データセットであるTraVLRを発表します。TraVLRの合成的な性質により、タスクに関連した次元で訓練とテストの分布を制約することができ、分布外の汎化を評価することが可能である。TraVLRの各例は、2つのモダリティでシーンを冗長に符号化し、関連する情報を失うことなく、トレーニング中またはテスト中にどちらかを削除または追加することができる。4つの最新V+Lモデルの性能を比較したところ、同じモダリティのテスト例では良好な性能を示すものの、クロスモダルの転送には失敗し、1つのモダリティの追加や削除に対応する成功は限定的であることがわかりました。TraVLRは、研究コミュニティへのオープンチャレンジとして公開します。

要約(オリジナル)

Numerous visio-linguistic (V+L) representation learning methods have been developed, yet existing datasets do not adequately evaluate the extent to which they represent visual and linguistic concepts in a unified space. We propose several novel evaluation settings for V+L models, including cross-modal transfer. Furthermore, existing V+L benchmarks often report global accuracy scores on the entire dataset, making it difficult to pinpoint the specific reasoning tasks that models fail and succeed at. We present TraVLR, a synthetic dataset comprising four V+L reasoning tasks. TraVLR’s synthetic nature allows us to constrain its training and testing distributions along task-relevant dimensions, enabling the evaluation of out-of-distribution generalisation. Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information. We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer and have limited success accommodating the addition or deletion of one modality. We release TraVLR as an open challenge for the research community.

arxiv情報

著者	Keng Ji Chow,Samson Tan,Min-Yen Kan
発行日	2023-03-04 12:57:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

TraVLR: Now You See It, Now You Don’t! A Bimodal Dataset for Evaluating Visio-Linguisic Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー