Token-free Models for Sarcasm Detection

要約

トークン化は、ほとんどの自然言語処理（NLP）パイプラインにおける基本的なステップであるが、語彙の不一致や語彙外の問題などの課題が生じる。最近の研究では、バイトレベルや文字レベルで生テキストを直接操作するモデルが、これらの制限を緩和できることが示されている。本稿では、ソーシャルメディア（Twitter）と非ソーシャルメディア（ニュースヘッドライン）の両領域における皮肉検出タスクについて、ByT5とCANINEの2つのトークンフリーモデルを評価する。トークン・ベースのベースラインや最先端のアプローチに対して、これらのモデルを微調整し、ベンチマークを行う。その結果、ByT5-smallとCANINEはトークンベースの同等モデルを上回り、ニュースヘッドラインとTwitter Sarcasmデータセットでそれぞれ0.77%と0.49%精度を向上させ、最先端の性能を達成しました。これらの結果は、ソーシャルメディアのようなノイズの多い非公式な領域において、ロバストなNLPを実現するトークンフリーのモデルの可能性を強調するものである。

要約(オリジナル)

Tokenization is a foundational step in most natural language processing (NLP) pipelines, yet it introduces challenges such as vocabulary mismatch and out-of-vocabulary issues. Recent work has shown that models operating directly on raw text at the byte or character level can mitigate these limitations. In this paper, we evaluate two token-free models, ByT5 and CANINE, on the task of sarcasm detection in both social media (Twitter) and non-social media (news headlines) domains. We fine-tune and benchmark these models against token-based baselines and state-of-the-art approaches. Our results show that ByT5-small and CANINE outperform token-based counterparts and achieve new state-of-the-art performance, improving accuracy by 0.77% and 0.49% on the News Headlines and Twitter Sarcasm datasets, respectively. These findings underscore the potential of token-free models for robust NLP in noisy and informal domains such as social media.

arxiv情報

著者	Sumit Mamtani,Maitreya Sonawane,Kanika Agarwal,Nishanth Sanjeev
発行日	2025-05-02 05:04:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Token-free Models for Sarcasm Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー