Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages

要約

言語識別（LI）は、さまざまな自然言語処理タスクにとって重要であり、感情分析、機械翻訳、情報検索などのアプリケーションの基本的なステップとして機能します。
インドのような多言語社会、特にソーシャルメディアに参加する若者の間では、テキストがコード混合を示すことが多く、地元の言語と英語を異なる言語レベルでブレンドします。
この現象は、特に言語が単一の単語内で混ざり合う場合、LIシステムに手ごわい課題を提示します。
インド南部で流行しているドラヴィダ語は、豊富な形態学的構造を持っているが、デジタルプラットフォームでの過小評価に苦しんでおり、コミュニケーションのためにローマまたはハイブリッドスクリプトの採用につながります。
このペーパーでは、Dravidian言語での単語レベルのLIの課題に対処することを目的とした共有タスクの迅速な方法を紹介します。
この作業では、GPT-3.5ターボを活用して、大規模な言語モデルが単語を正しいカテゴリに正しく分類できるかどうかを理解しました。
私たちの調査結果は、カンナダモデルがほとんどのメトリックでタミルモデルを常に上回っており、カンナダ語のインスタンスを特定して分類する際の精度と信頼性が高いことを示していることを示しています。
対照的に、タミル語モデルは中程度のパフォーマンスを示し、特に精度とリコールの改善が必要です。

要約(オリジナル)

Language Identification (LI) is crucial for various natural language processing tasks, serving as a foundational step in applications such as sentiment analysis, machine translation, and information retrieval. In multilingual societies like India, particularly among the youth engaging on social media, text often exhibits code-mixing, blending local languages with English at different linguistic levels. This phenomenon presents formidable challenges for LI systems, especially when languages intermingle within single words. Dravidian languages, prevalent in southern India, possess rich morphological structures yet suffer from under-representation in digital platforms, leading to the adoption of Roman or hybrid scripts for communication. This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages. In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories. Our findings show that the Kannada model consistently outperformed the Tamil model across most metrics, indicating a higher accuracy and reliability in identifying and categorizing Kannada language instances. In contrast, the Tamil model showed moderate performance, particularly needing improvement in precision and recall.

arxiv情報

著者	Aniket Deroy,Subhankar Maity
発行日	2025-03-12 16:57:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー