Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

要約

交通事故の認識は、自動運転システムや道路監視システムにとって不可欠な部分です。
事故はさまざまな形で発生する可能性があり、どのような種類の事故が発生しているのかを把握することは、再発防止に役立つ場合があります。
交通現場を特定のタイプの事故として分類できるようにするというタスクが、この作業の焦点です。
私たちは、交通シーンをグラフに例えることによってこの問題に取り組みます。グラフでは、車などのオブジェクトがノードとして表現され、オブジェクト間の相対的な距離と方向がエッジとして表現されます。
事故のこの表現はシーングラフと呼ばれ、事故分類子の入力として使用されます。
シーングラフ入力と視覚および言語からの表現を融合する分類器を使用すると、より良い結果が得られます。
この研究では、交通事故のビデオを前処理し、シーングラフとしてエンコードし、この表現を事故分類のための視覚および言語モダリティに合わせて調整するための多段階マルチモーダルパイプラインを導入しています。
4 つのクラスでトレーニングした場合、私たちのメソッドは、一般的な交通異常検出 (DoTA) ベンチマークの (不均衡な) サブセットで 57.77% のバランスの取れた精度スコアを達成しました。これは、シーングラフ情報が含まれる場合に比べて 5 パーセント近くの増加を示しています。
は考慮されません。

要約(オリジナル)

Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. The task of being able to classify a traffic scene as a specific type of accident is the focus of this work. We approach the problem by likening a traffic scene to a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of an accident can be referred to as a scene graph, and is used as input for an accident classifier. Better results can be obtained with a classifier that fuses the scene graph input with representations from vision and language. This work introduces a multi-stage, multimodal pipeline to pre-process videos of traffic accidents, encode them as scene graphs, and align this representation with vision and language modalities for accident classification. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.

arxiv情報

著者	Aaron Lohner,Francesco Compagno,Jonathan Francis,Alessandro Oltramari
発行日	2024-07-08 13:15:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー