Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks


ジオパーシングは、テキスト内の位置表現の緯度と経度 (座標) を推定するタスクです。
この論文では、Wikipedia の記事から地理解析用の大規模コーパスを構築する新しい方法である Wikipedia Hyperlink-based Location Linking (WHLL) を提案します。
WHLL は、Wikipedia のハイパーリンクを利用して、複数の位置表現に座標の注釈を付けます。
WHLL コーパスは 130 万の記事で構成されており、各記事には約 7.8 個の一意の位置表現が含まれています。
場所の表現の 45.6% は曖昧であり、同じ表記で複数の場所を参照しています。


Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.


著者 Keyaki Ohno,Hirotaka Kameko,Keisuke Shirai,Taichi Nishimura,Shinsuke Mori
発行日 2024-03-25 07:08:13+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス, Google

カテゴリー: cs.CL パーマリンク