five

jhu-clsp/kreyol-mt

收藏
Hugging Face2024-10-24 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/jhu-clsp/kreyol-mt
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - acf - aoa - bah - bzj - bzk - cri - crs - dcr - djk - fab - fng - fpe - gcf - gcr - gpe - gul - gyn - hat - icr - jam - kea - kri - ktu - lou - mfe - mue - pap - pcm - pov - pre - rcf - sag - srm - srn - svc - tpi - trf - wes - ara - aze - ceb - deu - eng - fra - nep - por - spa - zho license: other task_categories: - translation pretty_name: Kreyòl-MT configs: - config_name: acf-eng data_files: - split: test path: acf-eng/test-* - split: train path: acf-eng/train-* - split: validation path: acf-eng/validation-* - config_name: aoa-eng data_files: - split: test path: aoa-eng/test-* - split: train path: aoa-eng/train-* - split: validation path: aoa-eng/validation-* - config_name: bah-eng data_files: - split: test path: bah-eng/test-* - split: train path: bah-eng/train-* - split: validation path: bah-eng/validation-* - config_name: brc-eng data_files: - split: test path: brc-eng/test-* - split: train path: brc-eng/train-* - split: validation path: brc-eng/validation-* - config_name: bzj-eng data_files: - split: test path: bzj-eng/test-* - split: train path: bzj-eng/train-* - split: validation path: bzj-eng/validation-* - config_name: bzk-eng data_files: - split: test path: bzk-eng/test-* - split: train path: bzk-eng/train-* - split: validation path: bzk-eng/validation-* - config_name: cri-eng data_files: - split: test path: cri-eng/test-* - split: train path: cri-eng/train-* - split: validation path: cri-eng/validation-* - config_name: crs-eng data_files: - split: test path: crs-eng/test-* - split: train path: crs-eng/train-* - split: validation path: crs-eng/validation-* - config_name: dcr-eng data_files: - split: test path: dcr-eng/test-* - split: train path: dcr-eng/train-* - split: validation path: dcr-eng/validation-* - config_name: djk-ara data_files: - split: test path: djk-ara/test-* - split: train path: djk-ara/train-* - split: validation path: djk-ara/validation-* - config_name: djk-ceb data_files: - split: test path: djk-ceb/test-* - split: train path: djk-ceb/train-* - split: validation path: djk-ceb/validation-* - config_name: djk-deu data_files: - split: test path: djk-deu/test-* - split: train path: djk-deu/train-* - split: validation path: djk-deu/validation-* - config_name: djk-eng data_files: - split: test path: djk-eng/test-* - split: train path: djk-eng/train-* - split: validation path: djk-eng/validation-* - config_name: djk-fra data_files: - split: test path: djk-fra/test-* - split: train path: djk-fra/train-* - split: validation path: djk-fra/validation-* - config_name: djk-nep data_files: - split: test path: djk-nep/test-* - split: train path: djk-nep/train-* - split: validation path: djk-nep/validation-* - config_name: djk-zho data_files: - split: test path: djk-zho/test-* - split: train path: djk-zho/train-* - split: validation path: djk-zho/validation-* - config_name: fab-eng data_files: - split: test path: fab-eng/test-* - split: train path: fab-eng/train-* - split: validation path: fab-eng/validation-* - config_name: fng-eng data_files: - split: test path: fng-eng/test-* - split: train path: fng-eng/train-* - split: validation path: fng-eng/validation-* - config_name: fpe-eng data_files: - split: test path: fpe-eng/test-* - split: train path: fpe-eng/train-* - split: validation path: fpe-eng/validation-* - config_name: gcf-eng data_files: - split: test path: gcf-eng/test-* - split: train path: gcf-eng/train-* - split: validation path: gcf-eng/validation-* - config_name: gcf-fra data_files: - split: test path: gcf-fra/test-* - split: train path: gcf-fra/train-* - split: validation path: gcf-fra/validation-* - config_name: gcr-eng data_files: - split: test path: gcr-eng/test-* - split: train path: gcr-eng/train-* - split: validation path: gcr-eng/validation-* - config_name: gcr-fra data_files: - split: test path: gcr-fra/test-* - split: train path: gcr-fra/train-* - split: validation path: gcr-fra/validation-* - config_name: gpe-eng data_files: - split: test path: gpe-eng/test-* - split: train path: gpe-eng/train-* - split: validation path: gpe-eng/validation-* - config_name: gul-eng data_files: - split: test path: gul-eng/test-* - split: train path: gul-eng/train-* - split: validation path: gul-eng/validation-* - config_name: gyn-eng data_files: - split: test path: gyn-eng/test-* - split: train path: gyn-eng/train-* - split: validation path: gyn-eng/validation-* - config_name: hat-ara data_files: - split: test path: hat-ara/test-* - split: train path: hat-ara/train-* - split: validation path: hat-ara/validation-* - config_name: hat-aze data_files: - split: test path: hat-aze/test-* - split: train path: hat-aze/train-* - split: validation path: hat-aze/validation-* - config_name: hat-deu data_files: - split: test path: hat-deu/test-* - split: train path: hat-deu/train-* - split: validation path: hat-deu/validation-* - config_name: hat-eng data_files: - split: test path: hat-eng/test-* - split: train path: hat-eng/train-* - split: validation path: hat-eng/validation-* - config_name: hat-fra data_files: - split: test path: hat-fra/test-* - split: train path: hat-fra/train-* - split: validation path: hat-fra/validation-* - config_name: hat-nep data_files: - split: test path: hat-nep/test-* - split: train path: hat-nep/train-* - split: validation path: hat-nep/validation-* - config_name: hat-zho data_files: - split: test path: hat-zho/test-* - split: train path: hat-zho/train-* - split: validation path: hat-zho/validation-* - config_name: icr-eng data_files: - split: test path: icr-eng/test-* - split: train path: icr-eng/train-* - split: validation path: icr-eng/validation-* - config_name: jam-eng data_files: - split: test path: jam-eng/test-* - split: train path: jam-eng/train-* - split: validation path: jam-eng/validation-* - config_name: jam-fra data_files: - split: train path: jam-fra/train-* - config_name: kea-eng data_files: - split: test path: kea-eng/test-* - split: train path: kea-eng/train-* - split: validation path: kea-eng/validation-* - config_name: kea-fra data_files: - split: test path: kea-fra/test-* - split: train path: kea-fra/train-* - split: validation path: kea-fra/validation-* - config_name: kea-hat data_files: - split: test path: kea-hat/test-* - split: train path: kea-hat/train-* - split: validation path: kea-hat/validation-* - config_name: kea-spa data_files: - split: test path: kea-spa/test-* - split: train path: kea-spa/train-* - split: validation path: kea-spa/validation-* - config_name: kri-eng data_files: - split: test path: kri-eng/test-* - split: train path: kri-eng/train-* - split: validation path: kri-eng/validation-* - config_name: ktu-eng data_files: - split: test path: ktu-eng/test-* - split: train path: ktu-eng/train-* - split: validation path: ktu-eng/validation-* - config_name: lou-eng data_files: - split: test path: lou-eng/test-* - split: train path: lou-eng/train-* - split: validation path: lou-eng/validation-* - config_name: mart1259-eng data_files: - split: test path: mart1259-eng/test-* - split: train path: mart1259-eng/train-* - split: validation path: mart1259-eng/validation-* - config_name: mart1259-fra data_files: - split: test path: mart1259-fra/test-* - split: train path: mart1259-fra/train-* - split: validation path: mart1259-fra/validation-* - config_name: mfe-ara data_files: - split: test path: mfe-ara/test-* - split: train path: mfe-ara/train-* - split: validation path: mfe-ara/validation-* - config_name: mfe-aze data_files: - split: test path: mfe-aze/test-* - split: train path: mfe-aze/train-* - split: validation path: mfe-aze/validation-* - config_name: mfe-deu data_files: - split: test path: mfe-deu/test-* - split: train path: mfe-deu/train-* - split: validation path: mfe-deu/validation-* - config_name: mfe-eng data_files: - split: test path: mfe-eng/test-* - split: train path: mfe-eng/train-* - split: validation path: mfe-eng/validation-* - config_name: mfe-fra data_files: - split: test path: mfe-fra/test-* - split: train path: mfe-fra/train-* - split: validation path: mfe-fra/validation-* - config_name: mue-eng data_files: - split: test path: mue-eng/test-* - split: train path: mue-eng/train-* - split: validation path: mue-eng/validation-* - config_name: pap-ara data_files: - split: test path: pap-ara/test-* - split: train path: pap-ara/train-* - split: validation path: pap-ara/validation-* - config_name: pap-aze data_files: - split: test path: pap-aze/test-* - split: train path: pap-aze/train-* - split: validation path: pap-aze/validation-* - config_name: pap-deu data_files: - split: test path: pap-deu/test-* - split: train path: pap-deu/train-* - split: validation path: pap-deu/validation-* - config_name: pap-eng data_files: - split: test path: pap-eng/test-* - split: train path: pap-eng/train-* - split: validation path: pap-eng/validation-* - config_name: pap-fra data_files: - split: test path: pap-fra/test-* - split: train path: pap-fra/train-* - split: validation path: pap-fra/validation-* - config_name: pap-nep data_files: - split: test path: pap-nep/test-* - split: train path: pap-nep/train-* - split: validation path: pap-nep/validation-* - config_name: pap-por data_files: - split: test path: pap-por/test-* - split: train path: pap-por/train-* - split: validation path: pap-por/validation-* - config_name: pap-spa data_files: - split: test path: pap-spa/test-* - split: train path: pap-spa/train-* - split: validation path: pap-spa/validation-* - config_name: pap-zho data_files: - split: test path: pap-zho/test-* - split: train path: pap-zho/train-* - split: validation path: pap-zho/validation-* - config_name: pcm-eng data_files: - split: test path: pcm-eng/test-* - split: train path: pcm-eng/train-* - split: validation path: pcm-eng/validation-* - config_name: pov-eng data_files: - split: test path: pov-eng/test-* - split: train path: pov-eng/train-* - split: validation path: pov-eng/validation-* - config_name: pre-eng data_files: - split: test path: pre-eng/test-* - split: train path: pre-eng/train-* - split: validation path: pre-eng/validation-* - config_name: rcf-eng data_files: - split: test path: rcf-eng/test-* - split: train path: rcf-eng/train-* - split: validation path: rcf-eng/validation-* - config_name: sag-eng data_files: - split: test path: sag-eng/test-* - split: train path: sag-eng/train-* - split: validation path: sag-eng/validation-* - config_name: srm-eng data_files: - split: test path: srm-eng/test-* - split: train path: srm-eng/train-* - split: validation path: srm-eng/validation-* - config_name: srn-eng data_files: - split: test path: srn-eng/test-* - split: train path: srn-eng/train-* - split: validation path: srn-eng/validation-* - config_name: srn-fra data_files: - split: train path: srn-fra/train-* - config_name: svc-eng data_files: - split: test path: svc-eng/test-* - split: train path: svc-eng/train-* - split: validation path: svc-eng/validation-* - config_name: tpi-deu data_files: - split: test path: tpi-deu/test-* - split: train path: tpi-deu/train-* - split: validation path: tpi-deu/validation-* - config_name: tpi-eng data_files: - split: test path: tpi-eng/test-* - split: train path: tpi-eng/train-* - split: validation path: tpi-eng/validation-* - config_name: tpi-fra data_files: - split: train path: tpi-fra/train-* - config_name: trf-eng data_files: - split: test path: trf-eng/test-* - split: train path: trf-eng/train-* - split: validation path: trf-eng/validation-* - config_name: wes-eng data_files: - split: test path: wes-eng/test-* - split: train path: wes-eng/train-* - split: validation path: wes-eng/validation-* --- # Kreyòl-MT ![world map](./world_map.png) ![latin america map](./la_map.png) Welcome to our public data repository! Please download data for any langauge pair via the command `load_dataset("jhu-clsp/kreyol-mt", "<langauge-pair-name>")`. For example: ``` from datasets import load_dataset data = load_dataset("jhu-clsp/kreyol-mt", "acf-eng") ``` ## Dataset info The full dataset we intend to release is not quite here yet, unfortunately. We are still waiting on the LDC release of a portion of it, and the rest we want to release together. What's hosted here now is the exact data set we used to train our models in published work, "Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages" (to be presented at [NAACL 2024](https://2024.naacl.org/)), with the sentences from the Church of Jesus Christ of Latter-day Saints (CJCLDS) removed from train and dev sets. This is a temporary provision until these data's impending release on LDC. In the coming weeks and months we will add: - The CJCLDS data from LDC, upon its release - NLLB data that we excluded from our model training but decided to include in our public data release - All releasable monolingual data - Any additional data that we or others come across and incorporate: we intend this to be a living dataset! Additional upcoming updates: - Metadata indicating which aligned sentences came from which sources prior to our data splitting Since we are still awaiting the public release of CJCLDS data, please contact Nate Robinson at [n8rrobinson@gmail.com](mailto:n8rrobinson@gmail.com) for the full dataset if needed. ## Documentation Documentation of all our data, including license and release information for data from individual sources, is available at our GitHub repo [here](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/data-documentation). ## Cleaning All dev and test sets are cleaned already. For information on cleaning for train sets, see our GitHub repo [here](https://github.com/JHU-CLSP/Kreyol-MT/tree/main/scripts/cleaning). For unclenaed or additional sets, please contact the [authors](mailto:n8rrobinson@gmail.com) ## Paper and citation information Please see our paper: 📄 ["Kreyòl-MT: Building Machine Translation for Latin American, Caribbean, and Colonial African Creole Languages"](https://arxiv.org/abs/2405.05376) And cite our work: ``` @article{robinson2024krey, title={Krey$\backslash$ol-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages}, author={Robinson, Nathaniel R and Dabre, Raj and Shurtz, Ammon and Dent, Rasul and Onesi, Onenamiyi and Monroc, Claire Bizon and Grobol, Lo{\"\i}c and Muhammad, Hasan and Garg, Ashi and Etori, Naome A and others}, journal={arXiv preprint arXiv:2405.05376}, year={2024} } ```
提供机构:
jhu-clsp
原始信息汇总

数据集概述:Kreyòl-MT

数据集语言

  • 支持多种语言,包括但不限于:acf, aoa, bah, bzj, bzk, cri, crs, dcr, djk, fab, fng, fpe, gcf, gcr, gpe, gul, gyn, hat, icr, jam, kea, kri, ktu, lou, mfe, mue, pap, pcm, pov, pre, rcf, sag, srm, srn, svc, tpi, trf, wes, ara, aze, ceb, deu, eng, fra, nep, por, spa, zho。

许可证

  • 数据集许可证类型为“other”。

任务类别

  • 主要任务类别为翻译。

数据集配置

数据集包含多个配置,每个配置对应不同的语言对和数据文件。以下是部分配置示例:

  • config_name: acf-eng

    • data_files:
      • split: test
        • path: acf-eng/test-*
      • split: train
        • path: acf-eng/train-*
      • split: validation
        • path: acf-eng/validation-*
  • config_name: aoa-eng

    • data_files:
      • split: test
        • path: aoa-eng/test-*
      • split: train
        • path: aoa-eng/train-*
      • split: validation
        • path: aoa-eng/validation-*
  • config_name: djk-eng

    • data_files:
      • split: test
        • path: djk-eng/test-*
      • split: train
        • path: djk-eng/train-*
      • split: validation
        • path: djk-eng/validation-*
  • config_name: hat-eng

    • data_files:
      • split: test
        • path: hat-eng/test-*
      • split: train
        • path: hat-eng/train-*
      • split: validation
        • path: hat-eng/validation-*
  • config_name: pap-eng

    • data_files:
      • split: test
        • path: pap-eng/test-*
      • split: train
        • path: pap-eng/train-*
      • split: validation
        • path: pap-eng/validation-*

数据集更新计划

  • 未来将添加以下内容:
    • CJCLDS数据,待LDC发布后加入。
    • NLLB数据,虽未用于模型训练,但将包含在公共数据发布中。
    • 所有可发布的单语数据。
    • 其他任何发现并整合的数据,旨在使数据集保持更新。

数据集文档

  • 详细的数据文档,包括许可证和来自各个来源的数据发布信息,可在GitHub仓库中找到:Kreyol-MT数据文档

数据集清洗

  • 所有开发和测试集均已清洗。有关训练集清洗的信息,请参阅GitHub仓库:清洗脚本

论文与引用信息

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作