WiLI-2018
收藏arXiv2018-01-24 更新2024-06-21 收录
下载链接:
https://doi.org/10.5281/zenodo.841984
下载链接
链接失效反馈官方服务:
资源简介:
WiLI-2018是由Karlsruhe的优秀信息学学生基金会支持的Martin Thoma创建的书面语言识别基准数据集。该数据集包含来自Wikipedia的1000个段落,涵盖235种语言,总计235,000个段落。WiLI-2018旨在通过提供公开可用的、免费的数据集,促进语言识别技术的研究和发展。数据集中的每个段落都至少包含140个Unicode代码点,并且每个段落都属于235种语言中的一种。这些数据可用于训练语言识别模型、基准测试这些模型以及识别未知语言。此外,WiLI-2018还特别关注了语言的多样性,包括了从常见语言到少数语言的广泛范围,以及包括了如Esperanto、Ido、Interlingua等构造语言和如Latin等已死亡语言,但不包括如HTML、XML、LATEX、JSON和Markdown等人工语言。
WiLI-2018 is a written language identification benchmark dataset created by Martin Thoma with support from the Excellent Informatics Student Foundation of Karlsruhe. The dataset comprises 1,000 Wikipedia-derived paragraphs per language across 235 languages, amounting to a total of 235,000 paragraphs. WiLI-2018 is designed to advance research and development of language identification technologies by providing a publicly accessible and free dataset. Each paragraph in the dataset contains a minimum of 140 Unicode code points, and is assigned to exactly one of the 235 covered languages. This dataset can be used to train language identification models, benchmark these models, and identify unknown languages. Furthermore, WiLI-2018 places special focus on linguistic diversity, covering a broad range from widely spoken common languages to minority languages, including constructed languages such as Esperanto, Ido, and Interlingua, as well as extinct languages such as Latin, while excluding artificial languages including HTML, XML, LATEX, JSON, and Markdown.
提供机构:
Foundation for Gifted Informatics Students in Karlsruhe
创建时间:
2018-01-24
搜集汇总
数据集介绍

背景与挑战
背景概述
WiLI-2018是一个用于语言识别的基准数据集,包含235种语言的235000个段落,具有平衡的分布和预提供的训练-测试分割,适用于多语言文本分类研究。
以上内容由遇见数据集搜集并总结生成



