What’s in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

要約

自動ビデオ説明の分野では大きな進歩がありましたが、新しいドメインへの自動説明モデルの一般化パフォーマンスは、現実の世界でこれらのシステムを使用する上で大きな障壁となっています。
ほとんどの視覚的記述方法は、評価メトリックの増加につながるトレーニングデータのパターンをキャプチャして活用することが知られていますが、それらのパターンとは何でしょうか?
この作業では、いくつかの一般的な視覚的記述データセットを調べ、モデルが利用するが新しいドメインに一般化しないデータセット固有の言語パターンをキャプチャ、分析、および理解します。
トークンレベル、サンプルレベル、およびデータセットレベルでは、キャプションの多様性が、一般的で有益でないキャプションの生成の背後にある主な要因であることがわかりました。
さらに、最先端のモデルは、最新のメトリクスでホールドアウトされたグラウンドトゥルースキャプションよりも優れていること、およびこの効果がデータセットの言語的多様性の成果であることを示しています。
この言語の多様性を理解することは、強力なキャプションモデルを構築するための鍵です。新しいデータの収集において多様性を維持し、現在のモデルと指標を使用する際に限られた多様性の結果に対処するためのいくつかの方法とアプローチをお勧めします。

要約(オリジナル)

While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this effect is an artifact of linguistic diversity in datasets. Understanding this linguistic diversity is key to building strong captioning models, we recommend several methods and approaches for maintaining diversity in the collection of new data, and dealing with the consequences of limited diversity when using current models and metrics.

arxiv情報

著者	David M. Chan,Austin Myers,Sudheendra Vijayanarasimhan,David A. Ross,Bryan Seybold,John F. Canny
発行日	2023-01-12 19:24:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

What’s in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー