StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

要約

視覚的なストーリーテリングシステムは、フレーム全体でキャラクターのアイデンティティを維持し、アクションを適切な主題にリンクするのに苦労し、しばしば参照幻覚につながります。
これらの問題は、視覚的要素上のキャラクター、オブジェクト、およびその他のエンティティを接地することで対処できます。
StoryReasoningを提案します。これは、構造化されたシーン分析と根拠のあるストーリーの両方を備えた52,016の映画画像から派生した4,178のストーリーを含むデータセットを提案します。
各ストーリーは、構造化された表表現を介してマルチフレーム関係を明示的にモデル化しながら、フレーム間の文字とオブジェクトの一貫性を維持します。
私たちのアプローチは、視覚的な類似性と顔認識、明示的な物語モデリングの考え方の推論、および複数のフレームの視覚エンティティにテキスト要素をリンクする基礎スキームを使用して、クロスフレームオブジェクトの再識別を特徴としています。
QWEN2.5-VL 7Bを微調整することによりベースラインパフォーマンスを確立し、ストーリー全体で一貫したオブジェクト参照を維持しながら、エンドツーエンドオブジェクトの検出、再識別、ランドマーク検出を実行するQWenストーリーテラーを作成します。
評価は、非ファインチューニングモデルと比較した場合、ストーリーごとに平均で4.06から3.56（-12.3％）の幻覚に減少したことを示しています。

要約(オリジナル)

Visual storytelling systems struggle to maintain character identity across frames and link actions to appropriate subjects, frequently leading to referential hallucinations. These issues can be addressed through grounding of characters, objects, and other entities on the visual elements. We propose StoryReasoning, a dataset containing 4,178 stories derived from 52,016 movie images, with both structured scene analyses and grounded stories. Each story maintains character and object consistency across frames while explicitly modeling multi-frame relationships through structured tabular representations. Our approach features cross-frame object re-identification using visual similarity and face recognition, chain-of-thought reasoning for explicit narrative modeling, and a grounding scheme that links textual elements to visual entities across multiple frames. We establish baseline performance by fine-tuning Qwen2.5-VL 7B, creating Qwen Storyteller, which performs end-to-end object detection, re-identification, and landmark detection while maintaining consistent object references throughout the story. Evaluation demonstrates a reduction from 4.06 to 3.56 (-12.3%) hallucinations on average per story when compared to a non-fine-tuned model.

arxiv情報

著者	Daniel A. P. Oliveira,David Martins de Matos
発行日	2025-05-15 13:42:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー