prajdabre/KreolMorisienMT

Name: prajdabre/KreolMorisienMT
Creator: prajdabre
Published: 2022-06-02 01:25:14
License: 暂无描述

Hugging Face2022-06-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/prajdabre/KreolMorisienMT

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc --- MorisienMT is a dataset for Mauritian Creole Machine Translation. This dataset consists of training, development and test set splits for English--Creole as well as French--Creole translation. The data comes from a variety of sources and hence can be considered as belonging to the general domain. The development and test sets consist of 500 and 1000 sentences respectively. Both evaluation sets are trilingual. The training set for English--Creole contains 21,810 lines. The training set for French--Creole contains 15,239 lines. Additionally, one can extract a trilingual English-French-Creole training set of 13,861 lines using Creole as a pivot. Finally, we also provide a Creole monolingual corpus of 45,364 lines. Note that a significant portion of the dataset is a dictionary of word pairs/triplets, nevertheless it is a start. Usage: (TODO: beautify) 1. Using huggingface datasets: load_dataset("prajdabre/MorisienMT", "en-cr", split="train") 2. Convert to moses format: load the dataset as in step 1, each item is a json object so iterate over the loaded dataset object and use the key and value, "input" and "target" respectively, to get the translation pairs. Feel free to use the dataset for your research but don't forget to attribute our upcoming paper which will be uploaded to arxiv shortly. Note: MorisienMT was originally partly developed by Dr Aneerav Sukhoo from the University of Mauritius in 2014 when he was a visiting researcher in IIT Bombay. Dr Sukhoo and I worked on the MT experiments together, but never publicly released the dataset back then. Furthermore, the dataset splits and experiments were not done in a highly principled manner, which is required in the present day. Therefore, we improve the quality of splits and officially release the data for people to use.

提供机构：

prajdabre

原始信息汇总

MorisienMT 数据集概述

数据集类型

MorisienMT 是一个用于毛里求斯克里奥尔语机器翻译的数据集，包含英语至克里奥尔语和法语至克里奥尔语的翻译数据。

数据集组成

训练集:
- 英语至克里奥尔语: 21,810行
- 法语至克里奥尔语: 15,239行
- 三语（英语-法语-克里奥尔语）: 13,861行
开发集: 500句
测试集: 1000句
克里奥尔语单语语料库: 45,364行

数据来源

数据来源于多种资源，属于通用领域。

数据使用

可通过huggingface datasets加载数据集。
可转换为moses格式进行使用。

注意事项

数据集中包含大量词对/三元组字典，但仍处于起步阶段。使用时请引用即将上传至arxiv的论文。

5,000+

优质数据集

54 个

任务类型

进入经典数据集