five

llama-lang-adapt/wura

收藏
Hugging Face2024-01-11 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/llama-lang-adapt/wura
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: af data_files: - split: train path: - af/train.af.txt - split: eval path: af/eval.af.txt - config_name: am data_files: - split: train path: - am/train.am.txt - split: eval path: am/eval.am.txt - config_name: ar data_files: - split: train path: - ar/train.ar.txt - split: eval path: ar/eval.ar.txt - config_name: en data_files: - split: train path: - en/train.en.txt - split: eval path: en/eval.en.txt - config_name: fr data_files: - split: train path: - fr/train.fr.txt - split: eval path: fr/eval.fr.txt - config_name: ha data_files: - split: train path: - ha/train.ha.txt - split: eval path: ha/eval.ha.txt - config_name: ig data_files: - split: train path: - ig/train.ig.txt - split: eval path: ig/eval.ig.txt - config_name: ki data_files: - split: train path: - ki/train.ki.txt - split: eval path: ki/eval.ki.txt - config_name: mg data_files: - split: train path: - mg/train.mg.txt - split: eval path: mg/eval.mg.txt - config_name: ny data_files: - split: train path: - ny/train.ny.txt - split: eval path: ny/eval.ny.txt - config_name: or data_files: - split: train path: - or/train.or.txt - split: eval path: or/eval.or.txt - config_name: po data_files: - split: train path: - po/train.po.txt - split: eval path: po/eval.po.txt - config_name: sm data_files: - split: train path: - sm/train.sm.txt - split: eval path: sm/eval.sm.txt - config_name: sn data_files: - split: train path: - sn/train.sn.txt - split: eval path: sn/eval.sn.txt - config_name: st data_files: - split: train path: - st/train.st.txt - split: eval path: st/eval.st.txt - config_name: sw data_files: - split: train path: - sw/train.sw.txt - split: eval path: sw/eval.sw.txt - config_name: ti data_files: - split: train path: - ti/train.ti.txt - split: eval path: ti/eval.ti.txt - config_name: xh data_files: - split: train path: - xh/train.xh.txt - split: eval path: xh/eval.xh.txt - config_name: yo data_files: - split: train path: - yo/train.yo.txt - split: eval path: yo/eval.yo.txt - config_name: zu data_files: - split: train path: - zu/train.zu.txt - split: eval path: zu/eval.zu.txt --- A copy of the WURA dataset (V2 Passages). ``` langs = { "Afrikaans": "af", "Amharic": "am", "Egyptian Arabic": "ar", "English": "en", "French": "fr", "Hausa": "ha", "Igbo": "ig", "Kinyarwanda": "ki", "Malagasy": "mg", "Chichewa": "ny", "Afaan Oromoo": "or", "Portuguese": "po", "Somali": "sm", "Shona": "sn", "Sesotho": "st", "Swahili": "sw", "Tigrinya": "ti", "Xhosa": "xh", "Yoruba": "yo", "Zulu": "zu", } ```

配置项(configs): - 配置名称(config_name): af(南非荷兰语,Afrikaans) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - af/train.af.txt - 数据集划分(split): eval(评估集) 路径(path): af/eval.af.txt - 配置名称(config_name): am(阿姆哈拉语,Amharic) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - am/train.am.txt - 数据集划分(split): eval(评估集) 路径(path): am/eval.am.txt - 配置名称(config_name): ar(埃及阿拉伯语,Egyptian Arabic) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - ar/train.ar.txt - 数据集划分(split): eval(评估集) 路径(path): ar/eval.ar.txt - 配置名称(config_name): en(英语,English) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - en/train.en.txt - 数据集划分(split): eval(评估集) 路径(path): en/eval.en.txt - 配置名称(config_name): fr(法语,French) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - fr/train.fr.txt - 数据集划分(split): eval(评估集) 路径(path): fr/eval.fr.txt - 配置名称(config_name): ha(豪萨语,Hausa) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - ha/train.ha.txt - 数据集划分(split): eval(评估集) 路径(path): ha/eval.ha.txt - 配置名称(config_name): ig(伊博语,Igbo) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - ig/train.ig.txt - 数据集划分(split): eval(评估集) 路径(path): ig/eval.ig.txt - 配置名称(config_name): ki(卢旺达语,Kinyarwanda) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - ki/train.ki.txt - 数据集划分(split): eval(评估集) 路径(path): ki/eval.ki.txt - 配置名称(config_name): mg(马尔加什语,Malagasy) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - mg/train.mg.txt - 数据集划分(split): eval(评估集) 路径(path): mg/eval.mg.txt - 配置名称(config_name): ny(奇契瓦语,Chichewa) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - ny/train.ny.txt - 数据集划分(split): eval(评估集) 路径(path): ny/eval.ny.txt - 配置名称(config_name): or(奥罗莫语,Afaan Oromoo) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - or/train.or.txt - 数据集划分(split): eval(评估集) 路径(path): or/eval.or.txt - 配置名称(config_name): po(葡萄牙语,Portuguese) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - po/train.po.txt - 数据集划分(split): eval(评估集) 路径(path): po/eval.po.txt - 配置名称(config_name): sm(索马里语,Somali) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - sm/train.sm.txt - 数据集划分(split): eval(评估集) 路径(path): sm/eval.sm.txt - 配置名称(config_name): sn(绍纳语,Shona) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - sn/train.sn.txt - 数据集划分(split): eval(评估集) 路径(path): sn/eval.sn.txt - 配置名称(config_name): st(塞索托语,Sesotho) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - st/train.st.txt - 数据集划分(split): eval(评估集) 路径(path): st/eval.st.txt - 配置名称(config_name): sw(斯瓦希里语,Swahili) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - sw/train.sw.txt - 数据集划分(split): eval(评估集) 路径(path): sw/eval.sw.txt - 配置名称(config_name): ti(提格雷尼亚语,Tigrinya) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - ti/train.ti.txt - 数据集划分(split): eval(评估集) 路径(path): ti/eval.ti.txt - 配置名称(config_name): xh(科萨语,Xhosa) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - xh/train.xh.txt - 数据集划分(split): eval(评估集) 路径(path): xh/eval.xh.txt - 配置名称(config_name): yo(约鲁巴语,Yoruba) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - yo/train.yo.txt - 数据集划分(split): eval(评估集) 路径(path): yo/eval.yo.txt - 配置名称(config_name): zu(祖鲁语,Zulu) 数据文件(data_files): - 数据集划分(split): train(训练集) 路径(path): - zu/train.zu.txt - 数据集划分(split): eval(评估集) 路径(path): zu/eval.zu.txt --- 本数据集为WURA数据集(V2版段落数据)副本。 语言映射(langs) = { "南非荷兰语": "af", "阿姆哈拉语": "am", "埃及阿拉伯语": "ar", "英语": "en", "法语": "fr", "豪萨语": "ha", "伊博语": "ig", "卢旺达语": "ki", "马尔加什语": "mg", "奇契瓦语": "ny", "奥罗莫语": "or", "葡萄牙语": "po", "索马里语": "sm", "绍纳语": "sn", "塞索托语": "st", "斯瓦希里语": "sw", "提格雷尼亚语": "ti", "科萨语": "xh", "约鲁巴语": "yo", "祖鲁语": "zu", }
提供机构:
llama-lang-adapt
原始信息汇总

数据集概述

该数据集包含多个语言版本,每个语言版本分为训练集和评估集。以下是各语言版本的详细信息:

语言版本及文件路径

  • Afrikaans (af)

    • 训练集: af/train.af.txt
    • 评估集: af/eval.af.txt
  • Amharic (am)

    • 训练集: am/train.am.txt
    • 评估集: am/eval.am.txt
  • Egyptian Arabic (ar)

    • 训练集: ar/train.ar.txt
    • 评估集: ar/eval.ar.txt
  • English (en)

    • 训练集: en/train.en.txt
    • 评估集: en/eval.en.txt
  • French (fr)

    • 训练集: fr/train.fr.txt
    • 评估集: fr/eval.fr.txt
  • Hausa (ha)

    • 训练集: ha/train.ha.txt
    • 评估集: ha/eval.ha.txt
  • Igbo (ig)

    • 训练集: ig/train.ig.txt
    • 评估集: ig/eval.ig.txt
  • Kinyarwanda (ki)

    • 训练集: ki/train.ki.txt
    • 评估集: ki/eval.ki.txt
  • Malagasy (mg)

    • 训练集: mg/train.mg.txt
    • 评估集: mg/eval.mg.txt
  • Chichewa (ny)

    • 训练集: ny/train.ny.txt
    • 评估集: ny/eval.ny.txt
  • Afaan Oromoo (or)

    • 训练集: or/train.or.txt
    • 评估集: or/eval.or.txt
  • Portuguese (po)

    • 训练集: po/train.po.txt
    • 评估集: po/eval.po.txt
  • Somali (sm)

    • 训练集: sm/train.sm.txt
    • 评估集: sm/eval.sm.txt
  • Shona (sn)

    • 训练集: sn/train.sn.txt
    • 评估集: sn/eval.sn.txt
  • Sesotho (st)

    • 训练集: st/train.st.txt
    • 评估集: st/eval.st.txt
  • Swahili (sw)

    • 训练集: sw/train.sw.txt
    • 评估集: sw/eval.sw.txt
  • Tigrinya (ti)

    • 训练集: ti/train.ti.txt
    • 评估集: ti/eval.ti.txt
  • Xhosa (xh)

    • 训练集: xh/train.xh.txt
    • 评估集: xh/eval.xh.txt
  • Yoruba (yo)

    • 训练集: yo/train.yo.txt
    • 评估集: yo/eval.yo.txt
  • Zulu (zu)

    • 训练集: zu/train.zu.txt
    • 评估集: zu/eval.zu.txt
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作