Dakshina dataset

Name: Dakshina dataset
Creator: Google Research
Published: 2020-07-02 22:57:28
License: 暂无描述

arXiv2020-07-02 更新2024-06-21 收录

下载链接：

https://github.com/google-research-datasets/dakshina

下载链接

链接失效反馈

官方服务：

资源简介：

Dakshina数据集是由Google Research创建的一个新资源，包含12种南亚语言的文本数据。该数据集为每种语言提供了原生脚本维基百科文本、罗马化词典以及原生脚本和基本拉丁字母的完整句子平行数据。数据集的创建过程包括对每种语言维基百科文本的准备和选择、采样词典的罗马化收集以及从原生脚本集合中保留句子的手动罗马化。此外，数据集还提供了基于该数据集的几项任务的基准结果，包括单字罗马化、完整句子罗马化和原生脚本及罗马化文本的语言建模。Dakshina数据集旨在为区域语言的多样性模型训练和验证提供公开可用的数据。

The Dakshina Dataset is a novel resource developed by Google Research, encompassing text data for 12 South Asian languages. For each language, the dataset provides Wikipedia content in its native script, a romanized lexicon, and parallel sentence-level pairs with both native script and basic Latin alphabet forms. The dataset's construction involves three core steps: preparation and curation of Wikipedia text for each language, collection of sampled romanized lexicons, and manual romanization of selected sentences from native script corpora. Furthermore, the dataset supplies benchmark results for multiple downstream tasks based on it, including single-word romanization, full-sentence romanization, and language modeling on both native script and romanized texts. The Dakshina Dataset is intended to offer publicly accessible data for training and validating models focused on regional linguistic diversity.

提供机构：

Google Research

创建时间：

2020-07-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集