llama-lang-adapt/wura
收藏Hugging Face2024-01-11 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/llama-lang-adapt/wura
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: af
data_files:
- split: train
path:
- af/train.af.txt
- split: eval
path: af/eval.af.txt
- config_name: am
data_files:
- split: train
path:
- am/train.am.txt
- split: eval
path: am/eval.am.txt
- config_name: ar
data_files:
- split: train
path:
- ar/train.ar.txt
- split: eval
path: ar/eval.ar.txt
- config_name: en
data_files:
- split: train
path:
- en/train.en.txt
- split: eval
path: en/eval.en.txt
- config_name: fr
data_files:
- split: train
path:
- fr/train.fr.txt
- split: eval
path: fr/eval.fr.txt
- config_name: ha
data_files:
- split: train
path:
- ha/train.ha.txt
- split: eval
path: ha/eval.ha.txt
- config_name: ig
data_files:
- split: train
path:
- ig/train.ig.txt
- split: eval
path: ig/eval.ig.txt
- config_name: ki
data_files:
- split: train
path:
- ki/train.ki.txt
- split: eval
path: ki/eval.ki.txt
- config_name: mg
data_files:
- split: train
path:
- mg/train.mg.txt
- split: eval
path: mg/eval.mg.txt
- config_name: ny
data_files:
- split: train
path:
- ny/train.ny.txt
- split: eval
path: ny/eval.ny.txt
- config_name: or
data_files:
- split: train
path:
- or/train.or.txt
- split: eval
path: or/eval.or.txt
- config_name: po
data_files:
- split: train
path:
- po/train.po.txt
- split: eval
path: po/eval.po.txt
- config_name: sm
data_files:
- split: train
path:
- sm/train.sm.txt
- split: eval
path: sm/eval.sm.txt
- config_name: sn
data_files:
- split: train
path:
- sn/train.sn.txt
- split: eval
path: sn/eval.sn.txt
- config_name: st
data_files:
- split: train
path:
- st/train.st.txt
- split: eval
path: st/eval.st.txt
- config_name: sw
data_files:
- split: train
path:
- sw/train.sw.txt
- split: eval
path: sw/eval.sw.txt
- config_name: ti
data_files:
- split: train
path:
- ti/train.ti.txt
- split: eval
path: ti/eval.ti.txt
- config_name: xh
data_files:
- split: train
path:
- xh/train.xh.txt
- split: eval
path: xh/eval.xh.txt
- config_name: yo
data_files:
- split: train
path:
- yo/train.yo.txt
- split: eval
path: yo/eval.yo.txt
- config_name: zu
data_files:
- split: train
path:
- zu/train.zu.txt
- split: eval
path: zu/eval.zu.txt
---
A copy of the WURA dataset (V2 Passages).
```
langs = {
"Afrikaans": "af",
"Amharic": "am",
"Egyptian Arabic": "ar",
"English": "en",
"French": "fr",
"Hausa": "ha",
"Igbo": "ig",
"Kinyarwanda": "ki",
"Malagasy": "mg",
"Chichewa": "ny",
"Afaan Oromoo": "or",
"Portuguese": "po",
"Somali": "sm",
"Shona": "sn",
"Sesotho": "st",
"Swahili": "sw",
"Tigrinya": "ti",
"Xhosa": "xh",
"Yoruba": "yo",
"Zulu": "zu",
}
```
配置项(configs):
- 配置名称(config_name): af(南非荷兰语,Afrikaans)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- af/train.af.txt
- 数据集划分(split): eval(评估集)
路径(path): af/eval.af.txt
- 配置名称(config_name): am(阿姆哈拉语,Amharic)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- am/train.am.txt
- 数据集划分(split): eval(评估集)
路径(path): am/eval.am.txt
- 配置名称(config_name): ar(埃及阿拉伯语,Egyptian Arabic)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- ar/train.ar.txt
- 数据集划分(split): eval(评估集)
路径(path): ar/eval.ar.txt
- 配置名称(config_name): en(英语,English)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- en/train.en.txt
- 数据集划分(split): eval(评估集)
路径(path): en/eval.en.txt
- 配置名称(config_name): fr(法语,French)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- fr/train.fr.txt
- 数据集划分(split): eval(评估集)
路径(path): fr/eval.fr.txt
- 配置名称(config_name): ha(豪萨语,Hausa)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- ha/train.ha.txt
- 数据集划分(split): eval(评估集)
路径(path): ha/eval.ha.txt
- 配置名称(config_name): ig(伊博语,Igbo)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- ig/train.ig.txt
- 数据集划分(split): eval(评估集)
路径(path): ig/eval.ig.txt
- 配置名称(config_name): ki(卢旺达语,Kinyarwanda)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- ki/train.ki.txt
- 数据集划分(split): eval(评估集)
路径(path): ki/eval.ki.txt
- 配置名称(config_name): mg(马尔加什语,Malagasy)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- mg/train.mg.txt
- 数据集划分(split): eval(评估集)
路径(path): mg/eval.mg.txt
- 配置名称(config_name): ny(奇契瓦语,Chichewa)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- ny/train.ny.txt
- 数据集划分(split): eval(评估集)
路径(path): ny/eval.ny.txt
- 配置名称(config_name): or(奥罗莫语,Afaan Oromoo)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- or/train.or.txt
- 数据集划分(split): eval(评估集)
路径(path): or/eval.or.txt
- 配置名称(config_name): po(葡萄牙语,Portuguese)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- po/train.po.txt
- 数据集划分(split): eval(评估集)
路径(path): po/eval.po.txt
- 配置名称(config_name): sm(索马里语,Somali)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- sm/train.sm.txt
- 数据集划分(split): eval(评估集)
路径(path): sm/eval.sm.txt
- 配置名称(config_name): sn(绍纳语,Shona)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- sn/train.sn.txt
- 数据集划分(split): eval(评估集)
路径(path): sn/eval.sn.txt
- 配置名称(config_name): st(塞索托语,Sesotho)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- st/train.st.txt
- 数据集划分(split): eval(评估集)
路径(path): st/eval.st.txt
- 配置名称(config_name): sw(斯瓦希里语,Swahili)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- sw/train.sw.txt
- 数据集划分(split): eval(评估集)
路径(path): sw/eval.sw.txt
- 配置名称(config_name): ti(提格雷尼亚语,Tigrinya)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- ti/train.ti.txt
- 数据集划分(split): eval(评估集)
路径(path): ti/eval.ti.txt
- 配置名称(config_name): xh(科萨语,Xhosa)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- xh/train.xh.txt
- 数据集划分(split): eval(评估集)
路径(path): xh/eval.xh.txt
- 配置名称(config_name): yo(约鲁巴语,Yoruba)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- yo/train.yo.txt
- 数据集划分(split): eval(评估集)
路径(path): yo/eval.yo.txt
- 配置名称(config_name): zu(祖鲁语,Zulu)
数据文件(data_files):
- 数据集划分(split): train(训练集)
路径(path):
- zu/train.zu.txt
- 数据集划分(split): eval(评估集)
路径(path): zu/eval.zu.txt
---
本数据集为WURA数据集(V2版段落数据)副本。
语言映射(langs) = {
"南非荷兰语": "af",
"阿姆哈拉语": "am",
"埃及阿拉伯语": "ar",
"英语": "en",
"法语": "fr",
"豪萨语": "ha",
"伊博语": "ig",
"卢旺达语": "ki",
"马尔加什语": "mg",
"奇契瓦语": "ny",
"奥罗莫语": "or",
"葡萄牙语": "po",
"索马里语": "sm",
"绍纳语": "sn",
"塞索托语": "st",
"斯瓦希里语": "sw",
"提格雷尼亚语": "ti",
"科萨语": "xh",
"约鲁巴语": "yo",
"祖鲁语": "zu",
}
提供机构:
llama-lang-adapt
原始信息汇总
数据集概述
该数据集包含多个语言版本,每个语言版本分为训练集和评估集。以下是各语言版本的详细信息:
语言版本及文件路径
-
Afrikaans (af)
- 训练集:
af/train.af.txt - 评估集:
af/eval.af.txt
- 训练集:
-
Amharic (am)
- 训练集:
am/train.am.txt - 评估集:
am/eval.am.txt
- 训练集:
-
Egyptian Arabic (ar)
- 训练集:
ar/train.ar.txt - 评估集:
ar/eval.ar.txt
- 训练集:
-
English (en)
- 训练集:
en/train.en.txt - 评估集:
en/eval.en.txt
- 训练集:
-
French (fr)
- 训练集:
fr/train.fr.txt - 评估集:
fr/eval.fr.txt
- 训练集:
-
Hausa (ha)
- 训练集:
ha/train.ha.txt - 评估集:
ha/eval.ha.txt
- 训练集:
-
Igbo (ig)
- 训练集:
ig/train.ig.txt - 评估集:
ig/eval.ig.txt
- 训练集:
-
Kinyarwanda (ki)
- 训练集:
ki/train.ki.txt - 评估集:
ki/eval.ki.txt
- 训练集:
-
Malagasy (mg)
- 训练集:
mg/train.mg.txt - 评估集:
mg/eval.mg.txt
- 训练集:
-
Chichewa (ny)
- 训练集:
ny/train.ny.txt - 评估集:
ny/eval.ny.txt
- 训练集:
-
Afaan Oromoo (or)
- 训练集:
or/train.or.txt - 评估集:
or/eval.or.txt
- 训练集:
-
Portuguese (po)
- 训练集:
po/train.po.txt - 评估集:
po/eval.po.txt
- 训练集:
-
Somali (sm)
- 训练集:
sm/train.sm.txt - 评估集:
sm/eval.sm.txt
- 训练集:
-
Shona (sn)
- 训练集:
sn/train.sn.txt - 评估集:
sn/eval.sn.txt
- 训练集:
-
Sesotho (st)
- 训练集:
st/train.st.txt - 评估集:
st/eval.st.txt
- 训练集:
-
Swahili (sw)
- 训练集:
sw/train.sw.txt - 评估集:
sw/eval.sw.txt
- 训练集:
-
Tigrinya (ti)
- 训练集:
ti/train.ti.txt - 评估集:
ti/eval.ti.txt
- 训练集:
-
Xhosa (xh)
- 训练集:
xh/train.xh.txt - 评估集:
xh/eval.xh.txt
- 训练集:
-
Yoruba (yo)
- 训练集:
yo/train.yo.txt - 评估集:
yo/eval.yo.txt
- 训练集:
-
Zulu (zu)
- 训练集:
zu/train.zu.txt - 评估集:
zu/eval.zu.txt
- 训练集:



