Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning

要約

精度と多様性は、自然で意味的に正しいキャプションを生成する上で不可欠な計測可能な 2 つの表現です。
トレードオフのギャップのために、それらの 1 つを強化し、別の機能を弱めるために多くの努力が払われてきました。
この作業では、人間の注釈 (leave-one-out) から引き出される精度の劣った基準が、機械生成のキャプションには適していないことを示します。
確かな精度のパフォーマンスで多様性を改善するために、新しい変分変換フレームワークを活用しました。
「Invisible Information Prior」と「Auto-selectable GMM」を導入することで、正確性を保証するために、さまざまなシーンで正確な言語情報とオブジェクトの関係を学習するようにエンコーダーに指示します。
「Range-Median Reward」ベースラインを導入することで、多様性保証のための RL ベースのトレーニングプロセス中に、より高い報酬でより多様な候補者を保持します。
実験によると、私たちの方法は精度 (CIDEr) と多様性 (self-CIDEr) を最大 1.1% と 4.8% まで同時に向上させることができます。
また、R@1(i2t) で 50.3 (人間の 50.6) という、ヒューマンアノテーションと比較した意味検索のパフォーマンスが最も類似していました。

要約(オリジナル)

Accuracy and Diversity are two essential metrizable manifestations in generating natural and semantically correct captions. Many efforts have been made to enhance one of them with another decayed due to the trade-off gap. In this work, we will show that the inferior standard of accuracy draws from human annotations (leave-one-out) are not appropriate for machine-generated captions. To improve diversity with a solid accuracy performance, we exploited a novel Variational Transformer framework. By introducing the ‘Invisible Information Prior’ and the ‘Auto-selectable GMM’, we instruct the encoder to learn the precise language information and object relation in different scenes for accuracy assurance. By introducing the ‘Range-Median Reward’ baseline, we retain more diverse candidates with higher rewards during the RL-based training process for diversity assurance. Experiments show that our method achieves the simultaneous promotion of accuracy (CIDEr) and diversity (self-CIDEr), up to 1.1 and 4.8 percent. Also, our method got the most similar performance of the semantic retrieval compared to human annotations, with 50.3 (50.6 of human) for R@1(i2t).

arxiv情報

著者	Longzhen Yang,Yihang Liu,Yitao Peng,Lianghua He
発行日	2022-09-21 12:21:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー