1000-Language Web Text Corpus

Name: 1000-Language Web Text Corpus
Creator: 谷歌研究
Published: 2020-10-29 23:18:35
License: 暂无描述

arXiv2020-10-29 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/2010.14571v2

下载链接

链接失效反馈

官方服务：

资源简介：

谷歌研究团队创建了一个名为‘1000-Language Web Text Corpus’的数据集，旨在收集并整理来自互联网的多种语言文本。该数据集包含至少100,000个相对干净的句子，覆盖超过500种语言，为机器翻译和自动语音识别等技术的发展提供了丰富的资源。创建过程中，团队面临了语言识别准确性、数据噪声和语言间相似性等挑战，通过开发可调精度的词表过滤技术和半监督语言识别模型，有效提升了数据集的质量。该数据集主要用于支持多语言技术的发展，特别是在低资源语言的处理上。

A Google Research team created a dataset named '1000-Language Web Text Corpus', which is designed to collect and curate multilingual text from the internet. This dataset contains at least 100,000 relatively clean sentences, covering over 500 languages, and provides rich resources for the advancement of technologies such as machine translation and automatic speech recognition. During the dataset development process, the team encountered challenges including language identification accuracy, data noise, and cross-linguistic similarity. To mitigate these issues, they developed vocabulary filtering technology with adjustable precision and a semi-supervised language identification model, which effectively enhanced the overall quality of the dataset. This dataset is primarily intended to support the advancement of multilingual technologies, particularly for the processing of low-resource languages.

提供机构：

谷歌研究

创建时间：

2020-10-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集