WiLI-2018

Name: WiLI-2018
Creator: Foundation for Gifted Informatics Students in Karlsruhe
Published: 2018-01-24 05:40:53
License: 暂无描述

arXiv2018-01-24 更新2024-06-21 收录

下载链接：

https://doi.org/10.5281/zenodo.841984

下载链接

链接失效反馈

官方服务：

资源简介：

WiLI-2018是由Karlsruhe的优秀信息学学生基金会支持的Martin Thoma创建的书面语言识别基准数据集。该数据集包含来自Wikipedia的1000个段落，涵盖235种语言，总计235,000个段落。WiLI-2018旨在通过提供公开可用的、免费的数据集，促进语言识别技术的研究和发展。数据集中的每个段落都至少包含140个Unicode代码点，并且每个段落都属于235种语言中的一种。这些数据可用于训练语言识别模型、基准测试这些模型以及识别未知语言。此外，WiLI-2018还特别关注了语言的多样性，包括了从常见语言到少数语言的广泛范围，以及包括了如Esperanto、Ido、Interlingua等构造语言和如Latin等已死亡语言，但不包括如HTML、XML、LATEX、JSON和Markdown等人工语言。

WiLI-2018 is a written language identification benchmark dataset created by Martin Thoma with support from the Excellent Informatics Student Foundation of Karlsruhe. The dataset comprises 1,000 Wikipedia-derived paragraphs per language across 235 languages, amounting to a total of 235,000 paragraphs. WiLI-2018 is designed to advance research and development of language identification technologies by providing a publicly accessible and free dataset. Each paragraph in the dataset contains a minimum of 140 Unicode code points, and is assigned to exactly one of the 235 covered languages. This dataset can be used to train language identification models, benchmark these models, and identify unknown languages. Furthermore, WiLI-2018 places special focus on linguistic diversity, covering a broad range from widely spoken common languages to minority languages, including constructed languages such as Esperanto, Ido, and Interlingua, as well as extinct languages such as Latin, while excluding artificial languages including HTML, XML, LATEX, JSON, and Markdown.

提供机构：

Foundation for Gifted Informatics Students in Karlsruhe

创建时间：

2018-01-24

搜集汇总

数据集介绍