RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

要約

画像の復帰は、さまざまなマルチモーダルタスクの品質が向上したトレーニングデータセットを生成するために広く使用されています。
既存の復帰方法は通常、テキストの説明を強化するために強力なマルチモーダル大手言語モデル（MLLM）に依存していますが、しばしば幻覚と細粒の詳細が欠落していることによって引き起こされる不完全性のために不正確さに苦しむことがよくあります。
これらの制限に対処するために、視覚的再構成を通じてキャプションを改良する新しいフレームワークであるRicoを提案します。
具体的には、テキストから画像へのモデルを活用してキャプションを参照画像に再構築し、MLLMに元の画像と再構築された画像間の不一致を特定してキャプションを改良するように促します。
このプロセスは繰り返し実行され、さらに忠実で包括的な説明の生成をさらに徐々に促進します。
反復プロセスによって引き起こされる追加の計算コストを軽減するために、DPOを使用してRICOなどのキャプションを生成することを学ぶRICO-Flashを導入します。
広範な実験は、私たちのアプローチがキャプションの精度と完全性を大幅に改善し、ほとんどのベースラインをCapsbenchとCombapの両方で約10％上回ることを示しています。
https://github.com/wangyuchi369/ricoでリリースされたコード。

要約(オリジナル)

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

arxiv情報

著者	Yuchi Wang,Yishuo Cai,Shuhuai Ren,Sihan Yang,Linli Yao,Yuanxin Liu,Yuanxing Zhang,Pengfei Wan,Xu Sun
発行日	2025-05-28 17:29:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー