Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

要約

一般的な環境を積極的に探索しながら、任意のオブジェクトを説明する際のエージェントの能力を改善するための自己監視方法を提示します。
現在のモデルは、カメラの視点や乱雑さが異なるため、一貫した画像キャプションを取得するのに苦労しているため、これは挑戦的な問題です。
コンセンサスメカニズムを介してビュー全体でキャプションの精度と一貫性を高める既存のキャプションモデルを微調整するための3フェーズフレームワークを提案します。
まず、エージェントが環境を探索し、騒々しい画像キャプションのペアを収集します。
次に、各オブジェクトインスタンスの一貫した擬似キャプションが、大きな言語モデルを使用してコンセンサスを介して蒸留されます。
最後に、これらの擬似キャプションは、対照的な学習を追加して、既製のキャプションモデルを微調整するために使用されます。
手動でラベル付けされたテストセットで、キャプションモデル、探索ポリシー、擬似標識方法、微調整戦略の組み合わせのパフォーマンスを分析します。
結果は、古典的なベースラインと比較して、より高い意見の相違でサンプルを採掘するためにポリシーを訓練できることを示しています。
すべてのポリシーと組み合わせて、当社の擬似キャプション方法は、他の既存の方法と比較してセマンティックな類似性が高く、微調整により、キャプションの精度と一貫性が大幅に向上します。
https://hsp-iit.github.io/embodied-captioning/で入手可能なコードおよびテストセットアノテーション

要約(オリジナル)

We present a self-supervised method to improve an agent’s abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations available at https://hsp-iit.github.io/embodied-captioning/

arxiv情報

著者	Tommaso Galliena,Tommaso Apicella,Stefano Rosa,Pietro Morerio,Alessio Del Bue,Lorenzo Natale
発行日	2025-04-11 13:41:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー