udmurtNLP/flores-250-rus-udm

Name: udmurtNLP/flores-250-rus-udm
Creator: udmurtNLP
Published: 2023-09-28 16:31:33
License: 暂无描述

Hugging Face2023-09-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/udmurtNLP/flores-250-rus-udm

下载链接

链接失效反馈

官方服务：

资源简介：

FLORES-250数据集的俄语和乌德穆尔特语句子版本。与原始FLORES-250相比，俄语版本中的一个句子发生了变化。该数据集包含俄语和乌德穆尔特语的句子对，用于机器翻译等自然语言处理任务。

The Russian and Udmurt sentence-level variant of the FLORES-250 dataset. Compared with the original FLORES-250 dataset, one sentence in the Russian version has been modified. This dataset comprises sentence pairs in Russian and Udmurt, intended for natural language processing tasks such as machine translation.

提供机构：

udmurtNLP

原始信息汇总

FLORES-250, Russian and Udmurt sentences

数据集概述

配置名称: default
数据文件路径: data/sentences-*
特征:
- 名称: rus
  - 数据类型: string
- 名称: udm
  - 数据类型: string
分割:
- 名称: sentences
  - 字节数: 129728
  - 样本数: 250
下载大小: 72479
数据集大小: 129728
语言: udm

使用方法

python from datasets import load_dataset

dataset = load_dataset("udmurtNLP/flores-250-rus-udm")

引用

@inproceedings{yankovskaya-etal-2023-machine, title = "Machine Translation for Low-resource {F}inno-{U}gric Languages", author = {Yankovskaya, Lisa and Tars, Maali and T{"a}ttar, Andre and Fishel, Mark}, booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.77", pages = "762--771", abstract = "This paper focuses on neural machine translation (NMT) for low-resource Finno-Ugric languages. Our contributions are three-fold: (1) we extend existing and collect new parallel and monolingual corpora for 20 languages, (2) we expand the 200-language translation benchmark FLORES-200 with manual translations into nine new languages, and (3) we present experiments using the collected data to create NMT systems for the included languages and investigate the impact of back-translation data on the NMT performance for low-resource languages. Experimental results show that carefully selected limited amounts of back-translation directions yield the best results in terms of translation scores, for both high-resource and low-resource output languages.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集