Wanca 2017 corpus
收藏arXiv2020-08-27 更新2024-06-21 收录
下载链接:
https://www.kielipankki.fi/language-bank/
下载链接
链接失效反馈官方服务:
资源简介:
Wanca 2017 corpus是由赫尔辛基大学数字人文系创建的数据集,旨在为罕见的乌拉尔语言提供文本资源。该数据集包含从互联网爬取的1,515,068行文本,经过处理后得到447,927个相关语言的句子。数据集的创建过程涉及文本的下载、语言识别和去重等步骤。Wanca 2017 corpus主要用于乌拉尔语言识别(ULI)2020共享任务,以解决乌拉尔语言在计算语言学中的识别问题。
The Wanca 2017 corpus is a dataset created by the Department of Digital Humanities, University of Helsinki, aiming to provide textual resources for rare Uralic languages. It contains 1,515,068 lines of text crawled from the Internet, and after processing, 447,927 sentences of the target Uralic languages are obtained. The creation process of this dataset involves steps such as text downloading, language identification, and deduplication. The Wanca 2017 corpus was primarily used for the Uralic Language Identification (ULI) 2020 Shared Task to address the issue of Uralic language identification in computational linguistics.
提供机构:
赫尔辛基大学
创建时间:
2020-08-27



