Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

要約

参照表現理解 (REC) には、テキストの説明に基づいてターゲットインスタンスをローカライズすることが含まれます。
REC の最近の進歩は、RefCOCO で 92.44% の精度を達成した CogVLM のような大規模マルチモーダルモデル (LMM) によって推進されています。
ただし、この調査では、RefCOCO、RefCOCO+、RefCOCOg などの既存のベンチマークが LMM の包括的な機能を捉えているかどうか疑問視しています。
まずこれらのベンチマークを手動で検査すると、高いラベル付けエラー率 (RefCOCO で 14%、RefCOCO+ で 24%、RefCOCOg で 5%) が明らかになり、評価の信頼性が損なわれます。
私たちは、問題のあるインスタンスを除外し、REC タスクを処理できるいくつかの LMM を再評価することでこの問題に対処し、大幅な精度の向上を示し、ベンチマークノイズの影響を強調しました。
これに応えて、最新の REC モデルを評価するために特別に設計された包括的な REC ベンチマークである Ref-L4 を導入します。
Ref-L4 は 4 つの主要な特徴によって区別されます。1) 45,341 個の注釈を備えた相当なサンプルサイズ。
2) 365 の異なるタイプと 30 から 3,767 までのさまざまなインスタンススケールを持つ多様なオブジェクトカテゴリ。
3) 平均 24.2 ワードに及ぶ長い参照表現。
4) 22,813 の固有の単語からなる広範な語彙。
Ref-L4 で合計 24 個の大規模モデルを評価し、貴重な洞察を提供します。
RefCOCO、RefCOCO+、および RefCOCOg のクリーンなバージョンと、Ref-L4 ベンチマークおよび評価コードは、https://github.com/JierunChen/Ref-L4 で入手できます。

要約(オリジナル)

Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs’ comprehensive capabilities. We begin with a manual examination of these benchmarks, revealing high labeling error rates: 14% in RefCOCO, 24% in RefCOCO+, and 5% in RefCOCOg, which undermines the authenticity of evaluations. We address this by excluding problematic instances and reevaluating several LMMs capable of handling the REC task, showing significant accuracy improvements, thus highlighting the impact of benchmark noise. In response, we introduce Ref-L4, a comprehensive REC benchmark, specifically designed to evaluate modern REC models. Ref-L4 is distinguished by four key features: 1) a substantial sample size with 45,341 annotations; 2) a diverse range of object categories with 365 distinct types and varying instance scales from 30 to 3,767; 3) lengthy referring expressions averaging 24.2 words; and 4) an extensive vocabulary comprising 22,813 unique words. We evaluate a total of 24 large models on Ref-L4 and provide valuable insights. The cleaned versions of RefCOCO, RefCOCO+, and RefCOCOg, as well as our Ref-L4 benchmark and evaluation code, are available at https://github.com/JierunChen/Ref-L4.

arxiv情報

著者	Jierun Chen,Fangyun Wei,Jinjing Zhao,Sizhe Song,Bohuai Wu,Zhuoxuan Peng,S. -H. Gary Chan,Hongyang Zhang
発行日	2024-06-24 17:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー