kreyol-mt
收藏魔搭社区2025-12-05 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/jhu-clsp/kreyol-mt
下载链接
链接失效反馈官方服务:
资源简介:
# Kreyòl-MT


Welcome to our public data repository!
Please download data for any langauge pair via the command `load_dataset("jhu-clsp/kreyol-mt", "<langauge-pair-name>")`. For example:
```
from datasets import load_dataset
data = load_dataset("jhu-clsp/kreyol-mt", "acf-eng")
```
## Dataset info
The full dataset we intend to release is not quite here yet, unfortunately. We are still waiting on the LDC release of a
portion of it, and the rest we want to release together.
What's hosted here now is the exact data set we used to train our models in published work,
"Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages"
(to be presented at [NAACL 2024](https://2024.naacl.org/)), with the sentences from the Church of Jesus Christ of
Latter-day Saints (CJCLDS) removed from train and dev sets. This is a temporary provision until these data's impending
release on LDC.
In the coming weeks and months we will add:
- The CJCLDS data from LDC, upon its release
- NLLB data that we excluded from our model training but decided to include in our public data release
- All releasable monolingual data
- Any additional data that we or others come across and incorporate: we intend this to be a living dataset!
Additional upcoming updates:
- Metadata indicating which aligned sentences came from which sources prior to our data splitting
Since we are still awaiting the public release of CJCLDS data, please contact Nate Robinson at
[n8rrobinson@gmail.com](mailto:n8rrobinson@gmail.com) for the full dataset if needed.
## Documentation
Documentation of all our data, including license and release information for data from individual sources, is available at our
GitHub repo [here](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/data-documentation).
## Cleaning
All dev and test sets are cleaned already. For information on cleaning for train sets, see our GitHub repo [here](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/scripts/cleaning).
For unclenaed or additional sets, please contact the [authors](mailto:n8rrobinson@gmail.com)
## Paper and citation information
Please see our paper: 📄 ["Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages"](https://arxiv.org/abs/2405.05376)
And cite our work:
```
@article{robinson2024krey,
title={Krey$\backslash$ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages},
author={Robinson, Nathaniel R and Dabre, Raj and Shurtz, Ammon and Dent, Rasul and Onesi, Onenamiyi and Monroc, Claire Bizon and Grobol, Lo{\"\i}c and Muhammad, Hasan and Garg, Ashi and Etori, Naome A and others},
journal={arXiv preprint arXiv:2405.05376},
year={2024}
}
```
# Kreyòl-MT


欢迎来到我们的公开数据仓库!
请通过命令`load_dataset("jhu-clsp/kreyol-mt", "<语言对名称>")`下载任意语言对的数据。示例如下:
from datasets import load_dataset
data = load_dataset("jhu-clsp/kreyol-mt", "acf-eng")
## 数据集信息
遗憾的是,我们拟发布的完整数据集尚未完全就绪。目前我们仍在等待**语言数据联盟(Language Data Consortium, LDC)**对其中一部分数据的发布,其余数据则计划一并公开发布。当前托管于此的数据集,正是我们在已发表论文《Kreyòl-MT:为拉丁美洲、加勒比及殖民时期非洲克里奥尔语构建机器翻译系统》(即将在[NAACL 2024(北美计算语言学协会年会)](https://2024.naacl.org/)上发表)中用于模型训练的完整数据集,其中训练集与开发集已移除了耶稣基督后期圣徒教会(The Church of Jesus Christ of Latter-day Saints, CJCLDS)相关语句。这是临时过渡方案,直至这些数据正式在LDC发布。
未来数周及数月内,我们将新增以下内容:
- LDC发布的CJCLDS相关数据
- 我们原本在模型训练中排除、但决定纳入公开数据集的无语言遗漏计划(No Language Left Behind, NLLB)数据
- 所有可公开获取的单语数据
- 我们或其他研究者后续发现并可发布的额外数据——我们旨在将本数据集打造为持续更新的活数据集!
后续更新计划还包括:
- 标注对齐语句在数据拆分前的来源元数据
由于我们仍在等待CJCLDS相关数据的公开发布,若您需要完整数据集,请联系Nate Robinson,邮箱为[n8rrobinson@gmail.com](mailto:n8rrobinson@gmail.com)。
## 文档说明
所有数据的文档(包括各来源数据的许可与发布信息)可在我们的GitHub仓库[此处](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/data-documentation)查阅。
## 数据清洗
所有开发集与测试集均已完成清洗。若需了解训练集的清洗流程,请查阅我们的GitHub仓库[此处](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/scripts/cleaning)。如需获取未清洗数据或额外数据集,请联系[作者](mailto:n8rrobinson@gmail.com)。
## 论文与引用信息
请参阅我们的论文:📄《Kreyòl-MT:为拉丁美洲、加勒比及殖民时期非洲克里奥尔语构建机器翻译系统》(https://arxiv.org/abs/2405.05376)
并引用我们的工作:
@article{robinson2024krey,
title={Krey$\ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages},
author={Robinson, Nathaniel R and Dabre, Raj and Shurtz, Ammon and Dent, Rasul and Onesi, Onenamiyi and Monroc, Claire Bizon and Grobol, Lo{"i}c and Muhammad, Hasan and Garg, Ashi and Etori, Naome A and others},
journal={arXiv preprint arXiv:2405.05376},
year={2024}
}
提供机构:
maas
创建时间:
2025-09-11



