udmurtNLP/flores-250-rus-udm
收藏FLORES-250, Russian and Udmurt sentences
数据集概述
- 配置名称: default
- 数据文件路径: data/sentences-*
- 特征:
- 名称: rus
- 数据类型: string
- 名称: udm
- 数据类型: string
- 名称: rus
- 分割:
- 名称: sentences
- 字节数: 129728
- 样本数: 250
- 名称: sentences
- 下载大小: 72479
- 数据集大小: 129728
- 语言: udm
使用方法
python from datasets import load_dataset
dataset = load_dataset("udmurtNLP/flores-250-rus-udm")
引用
@inproceedings{yankovskaya-etal-2023-machine, title = "Machine Translation for Low-resource {F}inno-{U}gric Languages", author = {Yankovskaya, Lisa and Tars, Maali and T{"a}ttar, Andre and Fishel, Mark}, booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)", month = may, year = "2023", address = "T{o}rshavn, Faroe Islands", publisher = "University of Tartu Library", url = "https://aclanthology.org/2023.nodalida-1.77", pages = "762--771", abstract = "This paper focuses on neural machine translation (NMT) for low-resource Finno-Ugric languages. Our contributions are three-fold: (1) we extend existing and collect new parallel and monolingual corpora for 20 languages, (2) we expand the 200-language translation benchmark FLORES-200 with manual translations into nine new languages, and (3) we present experiments using the collected data to create NMT systems for the included languages and investigate the impact of back-translation data on the NMT performance for low-resource languages. Experimental results show that carefully selected limited amounts of back-translation directions yield the best results in terms of translation scores, for both high-resource and low-resource output languages.", }



