Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

要約

画像内のテキストが豊富なビジュアルコンテンツを理解する詳細なキャプションを生成することは、Large Vision-Language Model (LVLM) に対する注目を集めています。
しかし、詳細なキャプションに特化してその精度と包括性を測定するベンチマークを開発した研究はほとんどありません。
このペーパーでは、有向シーングラフビューから視覚的コンテキストを評価する、CompreCap と呼ばれる詳細なキャプションベンチマークを紹介します。
具体的には、まず、一般的なオブジェクトの語彙に従って、画像を意味論的に意味のある領域 (つまり、セマンティックセグメンテーションマスク) に手動でセグメント化し、同時にそれらすべての領域内のオブジェクトの属性も区別します。
次に、これらのオブジェクトの方向関係ラベルに注釈が付けられ、画像の豊富な構成情報を適切にエンコードできる有向シーングラフを構成します。
指示されたシーングラフに基づいて、LVLM から生成された詳細なキャプションを、オブジェクトレベルのカバレッジ、属性説明の精度、主要な関係のスコアなどを含む複数のレベルで評価するパイプラインを開発します。 CompreCap データセットの実験結果
私たちの評価方法が LVLM にわたる人間の評価スコアと密接に一致していることを確認します。

要約(オリジナル)

Generating detailed captions comprehending text-rich visual content in images has received growing attention for Large Vision-Language Models (LVLMs). However, few studies have developed benchmarks specifically tailored for detailed captions to measure their accuracy and comprehensiveness. In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view. Concretely, we first manually segment the image into semantically meaningful regions (i.e., semantic segmentation mask) according to common-object vocabulary, while also distinguishing attributes of objects within all those regions. Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image. Based on our directed scene graph, we develop a pipeline to assess the generated detailed captions from LVLMs on multiple levels, including the object-level coverage, the accuracy of attribute descriptions, the score of key relationships, etc. Experimental results on the CompreCap dataset confirm that our evaluation method aligns closely with human evaluation scores across LVLMs.

arxiv情報

著者	Fan Lu,Wei Wu,Kecheng Zheng,Shuailei Ma,Biao Gong,Jiawei Liu,Wei Zhai,Yang Cao,Yujun Shen,Zheng-Jun Zha
発行日	2024-12-12 06:33:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー