five

prajdabre/KreolMorisienMT

收藏
Hugging Face2022-06-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/prajdabre/KreolMorisienMT
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc --- MorisienMT is a dataset for Mauritian Creole Machine Translation. This dataset consists of training, development and test set splits for English--Creole as well as French--Creole translation. The data comes from a variety of sources and hence can be considered as belonging to the general domain. The development and test sets consist of 500 and 1000 sentences respectively. Both evaluation sets are trilingual. The training set for English--Creole contains 21,810 lines. The training set for French--Creole contains 15,239 lines. Additionally, one can extract a trilingual English-French-Creole training set of 13,861 lines using Creole as a pivot. Finally, we also provide a Creole monolingual corpus of 45,364 lines. Note that a significant portion of the dataset is a dictionary of word pairs/triplets, nevertheless it is a start. Usage: (TODO: beautify) 1. Using huggingface datasets: load_dataset("prajdabre/MorisienMT", "en-cr", split="train") 2. Convert to moses format: load the dataset as in step 1, each item is a json object so iterate over the loaded dataset object and use the key and value, "input" and "target" respectively, to get the translation pairs. Feel free to use the dataset for your research but don't forget to attribute our upcoming paper which will be uploaded to arxiv shortly. Note: MorisienMT was originally partly developed by Dr Aneerav Sukhoo from the University of Mauritius in 2014 when he was a visiting researcher in IIT Bombay. Dr Sukhoo and I worked on the MT experiments together, but never publicly released the dataset back then. Furthermore, the dataset splits and experiments were not done in a highly principled manner, which is required in the present day. Therefore, we improve the quality of splits and officially release the data for people to use.
提供机构:
prajdabre
原始信息汇总

MorisienMT 数据集概述

数据集类型

MorisienMT 是一个用于毛里求斯克里奥尔语机器翻译的数据集,包含英语至克里奥尔语和法语至克里奥尔语的翻译数据。

数据集组成

  • 训练集:
    • 英语至克里奥尔语: 21,810行
    • 法语至克里奥尔语: 15,239行
    • 三语(英语-法语-克里奥尔语): 13,861行
  • 开发集: 500句
  • 测试集: 1000句
  • 克里奥尔语单语语料库: 45,364行

数据来源

数据来源于多种资源,属于通用领域。

数据使用

  • 可通过huggingface datasets加载数据集。
  • 可转换为moses格式进行使用。

注意事项

数据集中包含大量词对/三元组字典,但仍处于起步阶段。使用时请引用即将上传至arxiv的论文。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作