Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions

要約

密に注釈付きの画像キャプションは、堅牢な視覚系のアラインメントの学習を大幅に促進しますが、人間の注釈の取り組みを体系的に最適化するための方法論は未熟なままです。
注釈付きサンプルの数を最大化し、固定された予算の制約（たとえば、人間の注釈時間の合計）の下で包括性を向上させるように設計されたAI-in-the-Loop方法論であるチェーンオブトーカー（Cotalk）を紹介します。
フレームワークは、2つの重要な洞察に基づいて構築されています。
第一に、後続のアノテーターは「残差」を注釈するだけであるため、従来の並列アノテーションと比較して冗長なワークロードを減少させます。
第二に、人間は、話を介してはるかに高いスループットで注釈を出力しながら読むことでテキスト入力をより速く処理します。
したがって、マルチモーダルインターフェイスにより、最適化された効率が可能になります。
2つの側面からフレームワークを評価します。詳細なキャプションをオブジェクトアトリブツリーに解析し、効果的な接続を分析することにより得られるセマンティックユニットの包括性を評価する本質的な評価。
外因性評価は、視覚言語のアライメントを促進する際に、注釈付きキャプションの実際的な使用を測定します。
8人の参加者を使用した実験では、私たちのトーカーのチェーン（cotalk）により、注釈速度（0.42対0.30単位/秒）が改善され、並列方法よりも検索性能（41.13 \％対40.52 \％）が改善されます。

要約(オリジナル)

While densely annotated image captions significantly facilitate the learning of robust vision-language alignment, methodologies for systematically optimizing human annotation efforts remain underexplored. We introduce Chain-of-Talkers (CoTalk), an AI-in-the-loop methodology designed to maximize the number of annotated samples and improve their comprehensiveness under fixed budget constraints (e.g., total human annotation time). The framework is built upon two key insights. First, sequential annotation reduces redundant workload compared to conventional parallel annotation, as subsequent annotators only need to annotate the “residual” — the missing visual information that previous annotations have not covered. Second, humans process textual input faster by reading while outputting annotations with much higher throughput via talking; thus a multimodal interface enables optimized efficiency. We evaluate our framework from two aspects: intrinsic evaluations that assess the comprehensiveness of semantic units, obtained by parsing detailed captions into object-attribute trees and analyzing their effective connections; extrinsic evaluation measures the practical usage of the annotated captions in facilitating vision-language alignment. Experiments with eight participants show our Chain-of-Talkers (CoTalk) improves annotation speed (0.42 vs. 0.30 units/sec) and retrieval performance (41.13\% vs. 40.52\%) over the parallel method.

arxiv情報

著者	Yijun Shen,Delong Chen,Fan Liu,Xingyu Wang,Chuanyi Zhang,Liang Yao,Yuhui Zheng
発行日	2025-05-28 17:45:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Chain-of-Talkers (CoTalk): Fast Human Annotation of Dense Image Captions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー