Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

要約

マルチモーダル画像テキストモデルは、過去数年間で目覚ましいパフォーマンスを示してきました。
ただし、実際のアプリケーションに採用する前に、分布の変化に対する堅牢性を評価することが重要です。
この研究では、5 つのタスク (画像テキスト検索、視覚推論、視覚含意、画像キャプション、テキストから画像への生成) に関する一般的な摂動下での 12 の一般的なオープンソース画像テキストモデルの堅牢性を調査します。
特に、既存のデータセットに 17 の画像摂動技術と 16 のテキスト摂動技術を適用することにより、いくつかの新しいマルチモーダル堅牢性ベンチマークを提案します。
マルチモーダルモデルは、画像とテキストの摂動、特に画像の摂動に対して堅牢ではないことが観察されています。
テストされた摂動手法の中で、文字レベルの摂動はテキストの最も深刻な分布シフトを構成し、ズームブラーは画像データの最も深刻なシフトです。
また、マルチモーダルモデルを適切に評価するために、2 つの新しい堅牢性指標 (マルチモーダル影響スコアの \textbf{MMI} と欠落オブジェクト率の \textbf{MOR}) も導入しました。
私たちの広範な研究が、堅牢なマルチモーダルモデルの開発の新しい方向性を明らかにすることを願っています。
詳細については、プロジェクトの Web ページ \url{https://MMRobustness.github.io} をご覧ください。

要約(オリジナル)

Multimodal image-text models have shown remarkable performance in the past few years. However, evaluating robustness against distribution shifts is crucial before adopting them in real-world applications. In this work, we investigate the robustness of 12 popular open-sourced image-text models under common perturbations on five tasks (image-text retrieval, visual reasoning, visual entailment, image captioning, and text-to-image generation). In particular, we propose several new multimodal robustness benchmarks by applying 17 image perturbation and 16 text perturbation techniques on top of existing datasets. We observe that multimodal models are not robust to image and text perturbations, especially to image perturbations. Among the tested perturbation methods, character-level perturbations constitute the most severe distribution shift for text, and zoom blur is the most severe shift for image data. We also introduce two new robustness metrics (\textbf{MMI} for MultiModal Impact score and \textbf{MOR} for Missing Object Rate) for proper evaluations of multimodal models. We hope our extensive study sheds light on new directions for the development of robust multimodal models. More details can be found on the project webpage: \url{https://MMRobustness.github.io}.

arxiv情報

著者	Jielin Qiu,Yi Zhu,Xingjian Shi,Florian Wenzel,Zhiqiang Tang,Ding Zhao,Bo Li,Mu Li
発行日	2024-01-19 15:29:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Robustness of Multimodal Image-Text Models under Distribution Shift

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー