Wikipedia CJK Corpora

Name: Wikipedia CJK Corpora
Creator: Queensland University of Technology
License: 暂无描述

Research Data Australia2024-12-21 收录

下载链接：

https://researchdata.edu.au/wikipedia-cjk-corpora/14198

下载链接

链接失效反馈

官方服务：

资源简介：

Wikipedia web pages in different languages are rarely linked except for the cross-lingual link between web pages about the same subject. Collected in June 2010, this data collection consists of 10GB of tagged Chinese, Japanese and Korean articles, converted from Wikipedia to an XML structure by a multi-lingual adaptation of the YAWN system (see Related Information). Data were collected as part of the NII Test Collection for IR Systems (NTCIR) Project, which aims to enhance research in Information Access (IA) technologies, including information retrieval, to enhance cross-lingual link discovery (a way of automatically finding potential links between documents written in different languages). Through cross-lingual link discovery, users are able to discover documents in languages which they are either familiar with, or which have a richer set of documents than in their language of choice.

提供机构：

Queensland University of Technology

5,000+

优质数据集

54 个

任务类型

进入经典数据集