Data Contamination Issues in Brain-to-Text Decoding

要約

非侵襲的な認知信号を自然言語にデコードすることは、実用的なブレインコンピューターインターフェイス (BCI) を構築する長年の目標でした。
最近の主要なマイルストーンは、機能的磁気共鳴画像法 (fMRI) や脳波 (EEG) などの認知信号を、オープンな語彙設定の下でテキストにデコードすることに成功しました。
ただし、認知信号デコードタスクでのトレーニング、検証、テストのためにデータセットを分割する方法については、依然として議論の余地があります。
この論文では、現在のデータセット分割方法について体系的な分析を実施し、データ汚染の存在がモデルのパフォーマンスを大幅に誇張していることを発見しました。
具体的には、まず被験者の認知信号の漏洩により、堅牢なエンコーダーのトレーニングが損なわれることがわかりました。
次に、テキスト刺激の漏洩により、自己回帰デコーダがテストセット内の情報を記憶することを証明します。
デコーダが非常に正確なテキストを生成するのは、デコーダが認知信号を真に理解しているからではありません。
データ汚染の影響を排除し、さまざまなモデルの汎化能力を公平に評価するために、さまざまなタイプの認知データセット（fMRI、EEGなど）に対する新しい分割方法を提案します。
また、さらなる研究のベースラインとして、提案されたデータセット分割パラダイムの下で SOTA Brain-to-Text デコードモデルのパフォーマンスをテストします。

要約(オリジナル)

Decoding non-invasive cognitive signals to natural language has long been the goal of building practical brain-computer interfaces (BCIs). Recent major milestones have successfully decoded cognitive signals like functional Magnetic Resonance Imaging (fMRI) and electroencephalogram (EEG) into text under open vocabulary setting. However, how to split the datasets for training, validating, and testing in cognitive signal decoding task still remains controversial. In this paper, we conduct systematic analysis on current dataset splitting methods and find the existence of data contamination largely exaggerates model performance. Specifically, first we find the leakage of test subjects’ cognitive signals corrupts the training of a robust encoder. Second, we prove the leakage of text stimuli causes the auto-regressive decoder to memorize information in test set. The decoder generates highly accurate text not because it truly understands cognitive signals. To eliminate the influence of data contamination and fairly evaluate different models’ generalization ability, we propose a new splitting method for different types of cognitive datasets (e.g. fMRI, EEG). We also test the performance of SOTA Brain-to-Text decoding models under the proposed dataset splitting paradigm as baselines for further research.

arxiv情報

著者	Congchi Yin,Qian Yu,Zhiwei Fang,Jie He,Changping Peng,Zhangang Lin,Jingping Shao,Piji Li
発行日	2023-12-26 13:29:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Data Contamination Issues in Brain-to-Text Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー