Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

要約

自動的に予測された人間のフィードバックを生成モデルのトレーニングプロセスに組み込むことは、最近大きな関心を集めていますが、推論時のフィードバックはあまり注目されていません。
トレーニング時の典型的なフィードバック、つまり 2 つのサンプルが与えられた場合の選択は、自然には推論フェーズに移行しません。
新しいタイプのフィードバックであるキャプション再定式化を導入し、人間の注釈に基づいて再定式化フィードバックを模倣するようにモデルをトレーニングします。
私たちの方法では、画像キャプションモデル自体をトレーニングする必要がないため、必要な計算量が大幅に少なくなります。
私たちは 2 種類の再定式化フィードバックを実験します。まず、生成されたキャプションのエラーを修正する人間による再定式化のデータセットを収集します。
このデータに基づいてトレーニングされた再定式化モデルを既存の画像キャプションモデルの推論フェーズに組み込むと、特に元のキャプションの品質が低い場合にキャプションが改善されることがわかりました。
私たちは、堅牢なモデルがあまり普及していない分野である英語以外の画像キャプションにこの方法を適用し、大幅な改善を実現しました。
次に、スタイル転送に再定式化を適用します。
定量的評価により、ドイツ語の画像キャプションと英語形式の転送に関する最先端のパフォーマンスが明らかになり、詳細な比較フレームワークによる人間による検証により、改善の具体的な軸が明らかになります。

要約(オリジナル)

Incorporating automatically predicted human feedback into the process of training generative models has attracted substantial recent interest, while feedback at inference time has received less attention. The typical feedback at training time, i.e., preferences of choice given two samples, does not naturally transfer to the inference phase. We introduce a novel type of feedback — caption reformulations — and train models to mimic reformulation feedback based on human annotations. Our method does not require training the image captioning model itself, thereby demanding substantially less computational effort. We experiment with two types of reformulation feedback: first, we collect a dataset of human reformulations that correct errors in the generated captions. We find that incorporating reformulation models trained on this data into the inference phase of existing image captioning models results in improved captions, especially when the original captions are of low quality. We apply our method to non-English image captioning, a domain where robust models are less prevalent, and gain substantial improvement. Second, we apply reformulations to style transfer. Quantitative evaluations reveal state-of-the-art performance on German image captioning and English style transfer, while human validation with a detailed comparative framework exposes the specific axes of improvement.

arxiv情報

著者	Uri Berger,Omri Abend,Lea Frermann,Gabriel Stanovsky
発行日	2025-01-08 14:00:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Image Captioning by Mimicking Human Reformulation Feedback at Inference-time

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー