jhu-clsp/kreyol-mt
收藏Hugging Face2024-10-24 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/jhu-clsp/kreyol-mt
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- acf
- aoa
- bah
- bzj
- bzk
- cri
- crs
- dcr
- djk
- fab
- fng
- fpe
- gcf
- gcr
- gpe
- gul
- gyn
- hat
- icr
- jam
- kea
- kri
- ktu
- lou
- mfe
- mue
- pap
- pcm
- pov
- pre
- rcf
- sag
- srm
- srn
- svc
- tpi
- trf
- wes
- ara
- aze
- ceb
- deu
- eng
- fra
- nep
- por
- spa
- zho
license: other
task_categories:
- translation
pretty_name: Kreyòl-MT
configs:
- config_name: acf-eng
data_files:
- split: test
path: acf-eng/test-*
- split: train
path: acf-eng/train-*
- split: validation
path: acf-eng/validation-*
- config_name: aoa-eng
data_files:
- split: test
path: aoa-eng/test-*
- split: train
path: aoa-eng/train-*
- split: validation
path: aoa-eng/validation-*
- config_name: bah-eng
data_files:
- split: test
path: bah-eng/test-*
- split: train
path: bah-eng/train-*
- split: validation
path: bah-eng/validation-*
- config_name: brc-eng
data_files:
- split: test
path: brc-eng/test-*
- split: train
path: brc-eng/train-*
- split: validation
path: brc-eng/validation-*
- config_name: bzj-eng
data_files:
- split: test
path: bzj-eng/test-*
- split: train
path: bzj-eng/train-*
- split: validation
path: bzj-eng/validation-*
- config_name: bzk-eng
data_files:
- split: test
path: bzk-eng/test-*
- split: train
path: bzk-eng/train-*
- split: validation
path: bzk-eng/validation-*
- config_name: cri-eng
data_files:
- split: test
path: cri-eng/test-*
- split: train
path: cri-eng/train-*
- split: validation
path: cri-eng/validation-*
- config_name: crs-eng
data_files:
- split: test
path: crs-eng/test-*
- split: train
path: crs-eng/train-*
- split: validation
path: crs-eng/validation-*
- config_name: dcr-eng
data_files:
- split: test
path: dcr-eng/test-*
- split: train
path: dcr-eng/train-*
- split: validation
path: dcr-eng/validation-*
- config_name: djk-ara
data_files:
- split: test
path: djk-ara/test-*
- split: train
path: djk-ara/train-*
- split: validation
path: djk-ara/validation-*
- config_name: djk-ceb
data_files:
- split: test
path: djk-ceb/test-*
- split: train
path: djk-ceb/train-*
- split: validation
path: djk-ceb/validation-*
- config_name: djk-deu
data_files:
- split: test
path: djk-deu/test-*
- split: train
path: djk-deu/train-*
- split: validation
path: djk-deu/validation-*
- config_name: djk-eng
data_files:
- split: test
path: djk-eng/test-*
- split: train
path: djk-eng/train-*
- split: validation
path: djk-eng/validation-*
- config_name: djk-fra
data_files:
- split: test
path: djk-fra/test-*
- split: train
path: djk-fra/train-*
- split: validation
path: djk-fra/validation-*
- config_name: djk-nep
data_files:
- split: test
path: djk-nep/test-*
- split: train
path: djk-nep/train-*
- split: validation
path: djk-nep/validation-*
- config_name: djk-zho
data_files:
- split: test
path: djk-zho/test-*
- split: train
path: djk-zho/train-*
- split: validation
path: djk-zho/validation-*
- config_name: fab-eng
data_files:
- split: test
path: fab-eng/test-*
- split: train
path: fab-eng/train-*
- split: validation
path: fab-eng/validation-*
- config_name: fng-eng
data_files:
- split: test
path: fng-eng/test-*
- split: train
path: fng-eng/train-*
- split: validation
path: fng-eng/validation-*
- config_name: fpe-eng
data_files:
- split: test
path: fpe-eng/test-*
- split: train
path: fpe-eng/train-*
- split: validation
path: fpe-eng/validation-*
- config_name: gcf-eng
data_files:
- split: test
path: gcf-eng/test-*
- split: train
path: gcf-eng/train-*
- split: validation
path: gcf-eng/validation-*
- config_name: gcf-fra
data_files:
- split: test
path: gcf-fra/test-*
- split: train
path: gcf-fra/train-*
- split: validation
path: gcf-fra/validation-*
- config_name: gcr-eng
data_files:
- split: test
path: gcr-eng/test-*
- split: train
path: gcr-eng/train-*
- split: validation
path: gcr-eng/validation-*
- config_name: gcr-fra
data_files:
- split: test
path: gcr-fra/test-*
- split: train
path: gcr-fra/train-*
- split: validation
path: gcr-fra/validation-*
- config_name: gpe-eng
data_files:
- split: test
path: gpe-eng/test-*
- split: train
path: gpe-eng/train-*
- split: validation
path: gpe-eng/validation-*
- config_name: gul-eng
data_files:
- split: test
path: gul-eng/test-*
- split: train
path: gul-eng/train-*
- split: validation
path: gul-eng/validation-*
- config_name: gyn-eng
data_files:
- split: test
path: gyn-eng/test-*
- split: train
path: gyn-eng/train-*
- split: validation
path: gyn-eng/validation-*
- config_name: hat-ara
data_files:
- split: test
path: hat-ara/test-*
- split: train
path: hat-ara/train-*
- split: validation
path: hat-ara/validation-*
- config_name: hat-aze
data_files:
- split: test
path: hat-aze/test-*
- split: train
path: hat-aze/train-*
- split: validation
path: hat-aze/validation-*
- config_name: hat-deu
data_files:
- split: test
path: hat-deu/test-*
- split: train
path: hat-deu/train-*
- split: validation
path: hat-deu/validation-*
- config_name: hat-eng
data_files:
- split: test
path: hat-eng/test-*
- split: train
path: hat-eng/train-*
- split: validation
path: hat-eng/validation-*
- config_name: hat-fra
data_files:
- split: test
path: hat-fra/test-*
- split: train
path: hat-fra/train-*
- split: validation
path: hat-fra/validation-*
- config_name: hat-nep
data_files:
- split: test
path: hat-nep/test-*
- split: train
path: hat-nep/train-*
- split: validation
path: hat-nep/validation-*
- config_name: hat-zho
data_files:
- split: test
path: hat-zho/test-*
- split: train
path: hat-zho/train-*
- split: validation
path: hat-zho/validation-*
- config_name: icr-eng
data_files:
- split: test
path: icr-eng/test-*
- split: train
path: icr-eng/train-*
- split: validation
path: icr-eng/validation-*
- config_name: jam-eng
data_files:
- split: test
path: jam-eng/test-*
- split: train
path: jam-eng/train-*
- split: validation
path: jam-eng/validation-*
- config_name: jam-fra
data_files:
- split: train
path: jam-fra/train-*
- config_name: kea-eng
data_files:
- split: test
path: kea-eng/test-*
- split: train
path: kea-eng/train-*
- split: validation
path: kea-eng/validation-*
- config_name: kea-fra
data_files:
- split: test
path: kea-fra/test-*
- split: train
path: kea-fra/train-*
- split: validation
path: kea-fra/validation-*
- config_name: kea-hat
data_files:
- split: test
path: kea-hat/test-*
- split: train
path: kea-hat/train-*
- split: validation
path: kea-hat/validation-*
- config_name: kea-spa
data_files:
- split: test
path: kea-spa/test-*
- split: train
path: kea-spa/train-*
- split: validation
path: kea-spa/validation-*
- config_name: kri-eng
data_files:
- split: test
path: kri-eng/test-*
- split: train
path: kri-eng/train-*
- split: validation
path: kri-eng/validation-*
- config_name: ktu-eng
data_files:
- split: test
path: ktu-eng/test-*
- split: train
path: ktu-eng/train-*
- split: validation
path: ktu-eng/validation-*
- config_name: lou-eng
data_files:
- split: test
path: lou-eng/test-*
- split: train
path: lou-eng/train-*
- split: validation
path: lou-eng/validation-*
- config_name: mart1259-eng
data_files:
- split: test
path: mart1259-eng/test-*
- split: train
path: mart1259-eng/train-*
- split: validation
path: mart1259-eng/validation-*
- config_name: mart1259-fra
data_files:
- split: test
path: mart1259-fra/test-*
- split: train
path: mart1259-fra/train-*
- split: validation
path: mart1259-fra/validation-*
- config_name: mfe-ara
data_files:
- split: test
path: mfe-ara/test-*
- split: train
path: mfe-ara/train-*
- split: validation
path: mfe-ara/validation-*
- config_name: mfe-aze
data_files:
- split: test
path: mfe-aze/test-*
- split: train
path: mfe-aze/train-*
- split: validation
path: mfe-aze/validation-*
- config_name: mfe-deu
data_files:
- split: test
path: mfe-deu/test-*
- split: train
path: mfe-deu/train-*
- split: validation
path: mfe-deu/validation-*
- config_name: mfe-eng
data_files:
- split: test
path: mfe-eng/test-*
- split: train
path: mfe-eng/train-*
- split: validation
path: mfe-eng/validation-*
- config_name: mfe-fra
data_files:
- split: test
path: mfe-fra/test-*
- split: train
path: mfe-fra/train-*
- split: validation
path: mfe-fra/validation-*
- config_name: mue-eng
data_files:
- split: test
path: mue-eng/test-*
- split: train
path: mue-eng/train-*
- split: validation
path: mue-eng/validation-*
- config_name: pap-ara
data_files:
- split: test
path: pap-ara/test-*
- split: train
path: pap-ara/train-*
- split: validation
path: pap-ara/validation-*
- config_name: pap-aze
data_files:
- split: test
path: pap-aze/test-*
- split: train
path: pap-aze/train-*
- split: validation
path: pap-aze/validation-*
- config_name: pap-deu
data_files:
- split: test
path: pap-deu/test-*
- split: train
path: pap-deu/train-*
- split: validation
path: pap-deu/validation-*
- config_name: pap-eng
data_files:
- split: test
path: pap-eng/test-*
- split: train
path: pap-eng/train-*
- split: validation
path: pap-eng/validation-*
- config_name: pap-fra
data_files:
- split: test
path: pap-fra/test-*
- split: train
path: pap-fra/train-*
- split: validation
path: pap-fra/validation-*
- config_name: pap-nep
data_files:
- split: test
path: pap-nep/test-*
- split: train
path: pap-nep/train-*
- split: validation
path: pap-nep/validation-*
- config_name: pap-por
data_files:
- split: test
path: pap-por/test-*
- split: train
path: pap-por/train-*
- split: validation
path: pap-por/validation-*
- config_name: pap-spa
data_files:
- split: test
path: pap-spa/test-*
- split: train
path: pap-spa/train-*
- split: validation
path: pap-spa/validation-*
- config_name: pap-zho
data_files:
- split: test
path: pap-zho/test-*
- split: train
path: pap-zho/train-*
- split: validation
path: pap-zho/validation-*
- config_name: pcm-eng
data_files:
- split: test
path: pcm-eng/test-*
- split: train
path: pcm-eng/train-*
- split: validation
path: pcm-eng/validation-*
- config_name: pov-eng
data_files:
- split: test
path: pov-eng/test-*
- split: train
path: pov-eng/train-*
- split: validation
path: pov-eng/validation-*
- config_name: pre-eng
data_files:
- split: test
path: pre-eng/test-*
- split: train
path: pre-eng/train-*
- split: validation
path: pre-eng/validation-*
- config_name: rcf-eng
data_files:
- split: test
path: rcf-eng/test-*
- split: train
path: rcf-eng/train-*
- split: validation
path: rcf-eng/validation-*
- config_name: sag-eng
data_files:
- split: test
path: sag-eng/test-*
- split: train
path: sag-eng/train-*
- split: validation
path: sag-eng/validation-*
- config_name: srm-eng
data_files:
- split: test
path: srm-eng/test-*
- split: train
path: srm-eng/train-*
- split: validation
path: srm-eng/validation-*
- config_name: srn-eng
data_files:
- split: test
path: srn-eng/test-*
- split: train
path: srn-eng/train-*
- split: validation
path: srn-eng/validation-*
- config_name: srn-fra
data_files:
- split: train
path: srn-fra/train-*
- config_name: svc-eng
data_files:
- split: test
path: svc-eng/test-*
- split: train
path: svc-eng/train-*
- split: validation
path: svc-eng/validation-*
- config_name: tpi-deu
data_files:
- split: test
path: tpi-deu/test-*
- split: train
path: tpi-deu/train-*
- split: validation
path: tpi-deu/validation-*
- config_name: tpi-eng
data_files:
- split: test
path: tpi-eng/test-*
- split: train
path: tpi-eng/train-*
- split: validation
path: tpi-eng/validation-*
- config_name: tpi-fra
data_files:
- split: train
path: tpi-fra/train-*
- config_name: trf-eng
data_files:
- split: test
path: trf-eng/test-*
- split: train
path: trf-eng/train-*
- split: validation
path: trf-eng/validation-*
- config_name: wes-eng
data_files:
- split: test
path: wes-eng/test-*
- split: train
path: wes-eng/train-*
- split: validation
path: wes-eng/validation-*
---
# Kreyòl-MT


Welcome to our public data repository!
Please download data for any langauge pair via the command `load_dataset("jhu-clsp/kreyol-mt", "<langauge-pair-name>")`. For example:
```
from datasets import load_dataset
data = load_dataset("jhu-clsp/kreyol-mt", "acf-eng")
```
## Dataset info
The full dataset we intend to release is not quite here yet, unfortunately. We are still waiting on the LDC release of a
portion of it, and the rest we want to release together.
What's hosted here now is the exact data set we used to train our models in published work,
"Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages"
(to be presented at [NAACL 2024](https://2024.naacl.org/)), with the sentences from the Church of Jesus Christ of
Latter-day Saints (CJCLDS) removed from train and dev sets. This is a temporary provision until these data's impending
release on LDC.
In the coming weeks and months we will add:
- The CJCLDS data from LDC, upon its release
- NLLB data that we excluded from our model training but decided to include in our public data release
- All releasable monolingual data
- Any additional data that we or others come across and incorporate: we intend this to be a living dataset!
Additional upcoming updates:
- Metadata indicating which aligned sentences came from which sources prior to our data splitting
Since we are still awaiting the public release of CJCLDS data, please contact Nate Robinson at
[n8rrobinson@gmail.com](mailto:n8rrobinson@gmail.com) for the full dataset if needed.
## Documentation
Documentation of all our data, including license and release information for data from individual sources, is available at our
GitHub repo [here](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/data-documentation).
## Cleaning
All dev and test sets are cleaned already. For information on cleaning for train sets, see our GitHub repo [here](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/scripts/cleaning).
For unclenaed or additional sets, please contact the [authors](mailto:n8rrobinson@gmail.com)
## Paper and citation information
Please see our paper: 📄 ["Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages"](https://arxiv.org/abs/2405.05376)
And cite our work:
```
@article{robinson2024krey,
title={Krey$\backslash$ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages},
author={Robinson, Nathaniel R and Dabre, Raj and Shurtz, Ammon and Dent, Rasul and Onesi, Onenamiyi and Monroc, Claire Bizon and Grobol, Lo{\"\i}c and Muhammad, Hasan and Garg, Ashi and Etori, Naome A and others},
journal={arXiv preprint arXiv:2405.05376},
year={2024}
}
```
提供机构:
jhu-clsp
原始信息汇总
数据集概述:Kreyòl-MT
数据集语言
- 支持多种语言,包括但不限于:acf, aoa, bah, bzj, bzk, cri, crs, dcr, djk, fab, fng, fpe, gcf, gcr, gpe, gul, gyn, hat, icr, jam, kea, kri, ktu, lou, mfe, mue, pap, pcm, pov, pre, rcf, sag, srm, srn, svc, tpi, trf, wes, ara, aze, ceb, deu, eng, fra, nep, por, spa, zho。
许可证
- 数据集许可证类型为“other”。
任务类别
- 主要任务类别为翻译。
数据集配置
数据集包含多个配置,每个配置对应不同的语言对和数据文件。以下是部分配置示例:
-
config_name: acf-eng
- data_files:
- split: test
- path: acf-eng/test-*
- split: train
- path: acf-eng/train-*
- split: validation
- path: acf-eng/validation-*
- split: test
- data_files:
-
config_name: aoa-eng
- data_files:
- split: test
- path: aoa-eng/test-*
- split: train
- path: aoa-eng/train-*
- split: validation
- path: aoa-eng/validation-*
- split: test
- data_files:
-
config_name: djk-eng
- data_files:
- split: test
- path: djk-eng/test-*
- split: train
- path: djk-eng/train-*
- split: validation
- path: djk-eng/validation-*
- split: test
- data_files:
-
config_name: hat-eng
- data_files:
- split: test
- path: hat-eng/test-*
- split: train
- path: hat-eng/train-*
- split: validation
- path: hat-eng/validation-*
- split: test
- data_files:
-
config_name: pap-eng
- data_files:
- split: test
- path: pap-eng/test-*
- split: train
- path: pap-eng/train-*
- split: validation
- path: pap-eng/validation-*
- split: test
- data_files:
数据集更新计划
- 未来将添加以下内容:
- CJCLDS数据,待LDC发布后加入。
- NLLB数据,虽未用于模型训练,但将包含在公共数据发布中。
- 所有可发布的单语数据。
- 其他任何发现并整合的数据,旨在使数据集保持更新。
数据集文档
- 详细的数据文档,包括许可证和来自各个来源的数据发布信息,可在GitHub仓库中找到:Kreyol-MT数据文档。
数据集清洗
- 所有开发和测试集均已清洗。有关训练集清洗的信息,请参阅GitHub仓库:清洗脚本。
论文与引用信息
-
引用格式:
@article{robinson2024krey, title={Krey$ackslash$ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages}, author={Robinson, Nathaniel R and Dabre, Raj and Shurtz, Ammon and Dent, Rasul and Onesi, Onenamiyi and Monroc, Claire Bizon and Grobol, Lo{"i}c and Muhammad, Hasan and Garg, Ashi and Etori, Naome A and others}, journal={arXiv preprint arXiv:2405.05376}, year={2024} }



