five

udmurtNLP/flores-250-rus-udm

收藏
Hugging Face2023-09-28 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/udmurtNLP/flores-250-rus-udm
下载链接
链接失效反馈
官方服务:
资源简介:
FLORES-250数据集的俄语和乌德穆尔特语句子版本。与原始FLORES-250相比,俄语版本中的一个句子发生了变化。该数据集包含俄语和乌德穆尔特语的句子对,用于机器翻译等自然语言处理任务。

The Russian and Udmurt sentence-level variant of the FLORES-250 dataset. Compared with the original FLORES-250 dataset, one sentence in the Russian version has been modified. This dataset comprises sentence pairs in Russian and Udmurt, intended for natural language processing tasks such as machine translation.
提供机构:
udmurtNLP
原始信息汇总

FLORES-250, Russian and Udmurt sentences

数据集概述

  • 配置名称: default
  • 数据文件路径: data/sentences-*
  • 特征:
    • 名称: rus
      • 数据类型: string
    • 名称: udm
      • 数据类型: string
  • 分割:
    • 名称: sentences
      • 字节数: 129728
      • 样本数: 250
  • 下载大小: 72479
  • 数据集大小: 129728
  • 语言: udm

使用方法

python from datasets import load_dataset

dataset = load_dataset("udmurtNLP/flores-250-rus-udm")

引用

@inproceedings{yankovskaya-etal-2023-machine, title = "Machine Translation for Low-resource {F}inno-{U}gric Languages", author = {Yankovskaya, Lisa and Tars, Maali and T{"a}ttar, Andre and Fishel, Mark}, booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.77", pages = "762--771", abstract = "This paper focuses on neural machine translation (NMT) for low-resource Finno-Ugric languages. Our contributions are three-fold: (1) we extend existing and collect new parallel and monolingual corpora for 20 languages, (2) we expand the 200-language translation benchmark FLORES-200 with manual translations into nine new languages, and (3) we present experiments using the collected data to create NMT systems for the included languages and investigate the impact of back-translation data on the NMT performance for low-resource languages. Experimental results show that carefully selected limited amounts of back-translation directions yield the best results in terms of translation scores, for both high-resource and low-resource output languages.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作