From Pixels to Titles: Video Game Identification by Screenshots using Convolutional Neural Networks

要約

このペーパーでは、10 個の畳み込みニューラルネットワーク (CNN) アーキテクチャ (VGG16、ResNet50、ResNet152、MobileNet、DenseNet169、DenseNet201、EfficientNetB0、EfficientNetB2、EfficientNetB3、および EfficientNetV2S) と 3 つのトランスフォーマーアーキテクチャ (ViT-B16、
ViT-L32、SwinT) は、Atari 2600 から PlayStation 5 までの 22 の家庭用コンソールシステムにわたっており、合計 8,796 のゲームと 170,881 のスクリーンショットがあります。
VGG16 を除くすべての CNN は、このタスクにおいてトランスフォーマーよりも優れたパフォーマンスを示しました。
ImageNet で事前にトレーニングされた重みを初期重みとして使用することで、EfficientNetV2S は、22 のシステムのうち 16 のシステムで最高の平均精度 (77.44%) と最高の精度を達成します。
DenseNet201 は 4 つのシステムの中で最高であり、EfficientNetB3 は残りの 2 つのシステムの中で最高です。
アーケードスクリーンショットデータセットで微調整された代替の初期重みを採用することで、EfficientNet アーキテクチャの精度が向上し、EfficientNetV2S は 77.63% のピーク精度に達し、平均で 26.9 から 24.5 に収束エポックが減少することが実証されました。
全体として、最適なアーキテクチャと重みの組み合わせは 78.79% の精度を達成しており、主に 15 システムの EfficientNetV2S が主導しています。
これらの発見は、スクリーンショットによるビデオゲーム識別における CNN の有効性を強調しています。

要約(オリジナル)

This paper investigates video game identification through single screenshots, utilizing ten convolutional neural network (CNN) architectures (VGG16, ResNet50, ResNet152, MobileNet, DenseNet169, DenseNet201, EfficientNetB0, EfficientNetB2, EfficientNetB3, and EfficientNetV2S) and three transformers architectures (ViT-B16, ViT-L32, and SwinT) across 22 home console systems, spanning from Atari 2600 to PlayStation 5, totalling 8,796 games and 170,881 screenshots. Except for VGG16, all CNNs outperformed the transformers in this task. Using ImageNet pre-trained weights as initial weights, EfficientNetV2S achieves the highest average accuracy (77.44%) and the highest accuracy in 16 of the 22 systems. DenseNet201 is the best in four systems and EfficientNetB3 is the best in the remaining two systems. Employing alternative initial weights fine-tuned in an arcade screenshots dataset boosts accuracy for EfficientNet architectures, with the EfficientNetV2S reaching a peak accuracy of 77.63% and demonstrating reduced convergence epochs from 26.9 to 24.5 on average. Overall, the combination of optimal architecture and weights attains 78.79% accuracy, primarily led by EfficientNetV2S in 15 systems. These findings underscore the efficacy of CNNs in video game identification through screenshots.

arxiv情報

著者	Fabricio Breve
発行日	2025-01-08 13:45:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Pixels to Titles: Video Game Identification by Screenshots using Convolutional Neural Networks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー