Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

要約

交通事故の認識は、自動運転システムや道路監視システムにとって不可欠な部分です。
事故はさまざまな形で発生する可能性があり、どのような種類の事故が発生しているのかを把握することは、再発防止に役立つ場合があります。
この研究では、交通現場を特定の事故タイプに分類することに焦点を当てています。
私たちは、交通シーンをグラフとして表現することでこの問題に取り組みます。このグラフでは、車などのオブジェクトをノードとして、オブジェクト間の相対的な距離と方向をエッジとして表現できます。
この交通シーンの表現はシーングラフと呼ばれ、事故分類子の入力として使用できます。
シーングラフ入力を視覚的およびテキスト表現と融合する分類器を使用すると、より良い結果が得られます。
この研究では、交通事故のビデオを前処理してシーングラフとしてエンコードし、分類タスクを実行する前にこの表現を視覚および言語モダリティに合わせて調整する、多段階のマルチモーダルパイプラインを導入しています。
4 つのクラスでトレーニングした場合、私たちのメソッドは、一般的な交通異常検出 (DoTA) ベンチマークの (不均衡な) サブセットで 57.77% のバランスの取れた精度スコアを達成しました。これは、シーングラフ情報が含まれる場合に比べて 5 パーセント近くの増加を示しています。
は考慮されません。

要約(オリジナル)

Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. This work focuses on classification of traffic scenes into specific accident types. We approach the problem by representing a traffic scene as a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of a traffic scene is referred to as a scene graph, and can be used as input for an accident classifier. Better results are obtained with a classifier that fuses the scene graph input with visual and textual representations. This work introduces a multi-stage, multimodal pipeline that pre-processes videos of traffic accidents, encodes them as scene graphs, and aligns this representation with vision and language modalities before executing the classification task. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.

arxiv情報

著者	Aaron Lohner,Francesco Compagno,Jonathan Francis,Alessandro Oltramari
発行日	2024-12-17 20:14:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー