Speakers Fill Lexical Semantic Gaps with Context


これが事実であるかどうかを調査するために、単語の語彙の曖昧さを、単語が取り得る意味のエントロピーとして操作し、これを推定する 2 つの方法を提供します。1 つは人間による注釈 (WordNet を使用) を必要とするもので、もう 1 つは必要のないものです (
BERT を使用)、多数の言語にすぐに適用できるようになります。
私たちは、6 つの高リソース言語において、曖昧さの BERT ベースの推定値と WordNet 内で単語が持つ同義語の数との間に有意なピアソン相関関係があることを示すことで、これらの測定値を検証します (例: 英語の $\rho = 0.40$)。
次に、単語の語彙の曖昧さが文脈の不確実性と負の相関があるはずであるという主な仮説を検証し、分析した 18 の類型的に多様な言語すべてで有意な相関関係を発見しました。


Lexical ambiguity is widespread in language, allowing for the reuse of economical word forms and therefore making language more efficient. If ambiguous words cannot be disambiguated from context, however, this gain in efficiency might make language less clear — resulting in frequent miscommunication. For a language to be clear and efficiently encoded, we posit that the lexical ambiguity of a word type should correlate with how much information context provides about it, on average. To investigate whether this is the case, we operationalise the lexical ambiguity of a word as the entropy of meanings it can take, and provide two ways to estimate this — one which requires human annotation (using WordNet), and one which does not (using BERT), making it readily applicable to a large number of languages. We validate these measures by showing that, on six high-resource languages, there are significant Pearson correlations between our BERT-based estimate of ambiguity and the number of synonyms a word has in WordNet (e.g. $\rho = 0.40$ in English). We then test our main hypothesis — that a word’s lexical ambiguity should negatively correlate with its contextual uncertainty — and find significant correlations on all 18 typologically diverse languages we analyse. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.


著者 Tiago Pimentel,Rowan Hall Maudslay,Damián Blasi,Ryan Cotterell
発行日 2024-05-28 16:00:24+00:00
