five

GiJoeHansFranz/Liseli

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/GiJoeHansFranz/Liseli
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 language: - bem - nya - toi - loz - lue - lun - kqn - en task_categories: - translation - text-generation - automatic-speech-recognition pretty_name: Liseli — Zambian Language Dataset (Parallel, Dictionary, Monolingual, Audio) size_categories: - 100K<n<1M configs: - config_name: parallel-bemba data_files: - split: train path: "parallel-corpus/bemba.parquet" - config_name: parallel-nyanja data_files: - split: train path: "parallel-corpus/nyanja.parquet" - config_name: parallel-tonga data_files: - split: train path: "parallel-corpus/tonga.parquet" - config_name: parallel-lozi data_files: - split: train path: "parallel-corpus/lozi.parquet" - config_name: parallel-luvale data_files: - split: train path: "parallel-corpus/luvale.parquet" - config_name: parallel-lunda data_files: - split: train path: "parallel-corpus/lunda.parquet" - config_name: parallel-kaonde data_files: - split: train path: "parallel-corpus/kaonde.parquet" - config_name: parallel-full data_files: - split: train path: "parallel-corpus/*.parquet" - config_name: dictionary data_files: - split: train path: "dictionary/entries.parquet" - config_name: mono-bemba data_files: - split: train path: "monolingual/bemba.parquet" - config_name: mono-nyanja data_files: - split: train path: "monolingual/nyanja.parquet" - config_name: mono-tonga data_files: - split: train path: "monolingual/tonga.parquet" - config_name: mono-lozi data_files: - split: train path: "monolingual/lozi.parquet" - config_name: mono-luvale data_files: - split: train path: "monolingual/luvale.parquet" - config_name: mono-lunda data_files: - split: train path: "monolingual/lunda.parquet" - config_name: mono-kaonde data_files: - split: train path: "monolingual/kaonde.parquet" - config_name: mono-full data_files: - split: train path: "monolingual/*.parquet" tags: - zambia - bemba - nyanja - tonga - lozi - luvale - lunda - kaonde - low-resource - parallel-corpus - dictionary - monolingual - asr --- # Liseli Open data for seven Zambian languages: **Bemba, Nyanja (Chichewa), Tonga, Lozi, Luvale, Lunda, Kaonde**, paired with English. This dataset aggregates four asset families used by the [Liseli](https://github.com/YumiMilling/liseli) project. All of it is released under CC-BY-SA-4.0 to match the most restrictive upstream source license. Cite the original sources when using specific subsets. ## Asset families | Asset | Size | Load with | |---|---|---| | **Parallel corpus** | 242,986 en ↔ xx pairs | `load_dataset("GiJoeHansFranz/Liseli", "parallel-bemba")` | | **Dictionary** | 43,010 entries (en → 7 langs) | `load_dataset("GiJoeHansFranz/Liseli", "dictionary")` | | **Monolingual corpus** | 298,770 sentences | `load_dataset("GiJoeHansFranz/Liseli", "mono-nyanja")` | | **Audio** | 82 language-course mp3s (~227 MB) | Files in `language-courses/audio/` | Use `parallel-full` or `mono-full` to concatenate all seven languages. ## 1. Parallel corpus (`parallel-*`) The headline pair count is **242,986** across 7 languages. The composition is heavily skewed toward religious text, so treat it as a collection of sub-corpora. | Source | Approx pairs | Notes | |---|---|---| | `bible` | ~193k | Verse-aligned parallel, all 7 languages. Dominant but domain-narrow. | | `ai-dictionary` | ~32k | **Single-word vocabulary lookups**, not sentence pairs. Duplicates the dictionary asset. | | `dmatekenya` | ~13.6k | Nyanja-only agricultural extension text. Good quality, narrow domain. | | `storybook` | ~2.9k | Children's narrative from [Storybooks Zambia](https://storybookszambia.net/). Highest-quality daily-language content; all 7 languages. | | `wikimedia` | ~1k | Wikipedia-sourced Nyanja pairs. | | `tatoeba`, `moe`, `community` | <100 | Misc. | **For daily-language use cases, the useful subset is `storybook + wikimedia + dmatekenya`** — a few thousand pairs per language, not the full total. Filter by the `source` column. **Schema:** `english`, `translation`, `language`, `domain`, `source`, `concept_id`, `sentence_id` ### Parallel pairs per target language | Language | Total pairs | |---|---| | nyanja | 55,363 | | lozi | 35,711 | | luvale | 35,547 | | tonga | 35,329 | | bemba | 35,140 | | lunda | 34,913 | | kaonde | 10,983 | ## 2. Dictionary (`dictionary`) Verified + scraped English ↔ Zambian-language word entries from open-licensed dictionary sources (FENZA Chinyanja, Harris Tonga, Chitonga, Bemba scrapes). **Schema:** `english`, `language`, `translation`, `status` ### Dictionary entries per target language | Language | Entries | |---|---| | tonga | 13,299 | | nyanja | 10,121 | | bemba | 4,547 | | lozi | 4,498 | | luvale | 4,279 | | lunda | 3,634 | | kaonde | 2,632 | ## 3. Monolingual corpus (`mono-*`) Per-language sentence corpus used for language modeling, ASR prompting, and vocabulary coverage. Aggregates Bible, MoE teaching modules, Storybooks Zambia, Zambezi Voice transcripts, JW.org Cinyanja, dmatekenya agricultural extension, Masakhane NER, USAID content, and other PDFs. **Schema:** `language`, `text`, `source`, `source_file`, `tier`, `domain`, `quality`, `dialect` ### Monolingual sentences per language | Language | Sentences | |---|---| | nyanja | 79,096 | | lozi | 43,931 | | bemba | 40,104 | | tonga | 39,897 | | luvale | 39,535 | | lunda | 37,979 | | kaonde | 18,228 | ## 4. Audio (`language-courses/audio/`) 82 YouTube-pulled language-tutorial mp3s (~227 MB) spanning 12 course series across Bemba, Nyanja, and Tonga. Includes Bembling lessons, Kaputu Bemba Teacher, Chichewa 101 (hkatsonga), Learn Tonga / Nyanja, Zedlexicon Tonga, and others. Transcripts for most episodes are in the Liseli git repo under `data/llm_extract_*.json`. These files are stored as-is (not as a HF dataset config). Pull them with: ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="GiJoeHansFranz/Liseli", repo_type="dataset", allow_patterns=["language-courses/audio/*.mp3"], local_dir="./liseli-audio", ) ``` ## Loading examples ```python from datasets import load_dataset # Parallel: just Bemba, all sources bem = load_dataset("GiJoeHansFranz/Liseli", "parallel-bemba", split="train") # Parallel: daily-language only (filter out Bible and machine-generated word lookups) daily = bem.filter(lambda r: r["source"] not in {"bible", "ai-dictionary"}) # Dictionary: all entries d = load_dataset("GiJoeHansFranz/Liseli", "dictionary", split="train") bemba_entries = d.filter(lambda r: r["language"] == "bemba") # Monolingual: all seven languages concatenated mono = load_dataset("GiJoeHansFranz/Liseli", "mono-full", split="train") ``` ## Licensing and attribution Upstream source licenses: - **Storybooks Zambia**: CC-BY - **BibleNLP / ebible**: varies by translation (see upstream) - **MoE teaching modules**: Government of Zambia, public - **Wikimedia**: CC-BY-SA - **Storybooks bilingual PDFs**: CC-BY - **dmatekenya**: MIT (upstream author attribution) - **Masakhane NER**: CC-BY - **FENZA Chinyanja dictionary**: open academic - **Harris Tonga dictionary**: public domain - **Zambezi Voice / BembaSpeech**: CC-BY-4.0 - **ai-dictionary**: LLM-extracted, not human-verified; use with caution The aggregated dataset is released under **CC-BY-SA-4.0** as the lowest common denominator. ## Known limitations - Religious/formal register is heavily over-represented in parallel + monolingual. - `ai-dictionary` entries are machine-generated and not human-verified. - **Kaonde** coverage is the thinnest on every axis (parallel, dictionary, monolingual). - No dialect labels. Bemba in particular has significant regional variation not captured here. - No per-row quality scores; rows are treated as "verified" only in the source-trust sense. - Parallel corpus contains `bible` sentences that overlap with `monolingual` bible corpus — deduplicate if you are using both. - No native-speaker pronunciation recordings yet (planned via the forthcoming tutor app). ## Citation If you use this dataset, please cite the Liseli project and the upstream sources you rely on. A formal citation entry will be added once a release is tagged.

许可证:CC BY-SA 4.0 语言: - 班巴语(bem) - 齐切瓦语(nyanja) - 通加语(toi) - 洛兹语(loz) - 卢瓦勒语(lue) - 隆达语(lun) - 卡翁德语(kqn) - 英语(en) 任务类别: - 机器翻译 - 文本生成 - 自动语音识别(Automatic Speech Recognition,ASR) 漂亮名称:Liseli——赞比亚语言数据集(平行语料库、词典、单语语料库、音频数据集) 规模类别:10万<样本数<100万 配置项: - 配置名称:parallel-bemba(平行语料库-班巴语) 数据文件: - 训练划分:"parallel-corpus/bemba.parquet" - 配置名称:parallel-nyanja(平行语料库-齐切瓦语) 数据文件: - 训练划分:"parallel-corpus/nyanja.parquet" - 配置名称:parallel-tonga(平行语料库-通加语) 数据文件: - 训练划分:"parallel-corpus/tonga.parquet" - 配置名称:parallel-lozi(平行语料库-洛兹语) 数据文件: - 训练划分:"parallel-corpus/lozi.parquet" - 配置名称:parallel-luvale(平行语料库-卢瓦勒语) 数据文件: - 训练划分:"parallel-corpus/luvale.parquet" - 配置名称:parallel-lunda(平行语料库-隆达语) 数据文件: - 训练划分:"parallel-corpus/lunda.parquet" - 配置名称:parallel-kaonde(平行语料库-卡翁德语) 数据文件: - 训练划分:"parallel-corpus/kaonde.parquet" - 配置名称:parallel-full(平行语料库-全量) 数据文件: - 训练划分:"parallel-corpus/*.parquet" - 配置名称:dictionary(词典) 数据文件: - 训练划分:"dictionary/entries.parquet" - 配置名称:mono-bemba(单语语料库-班巴语) 数据文件: - 训练划分:"monolingual/bemba.parquet" - 配置名称:mono-nyanja(单语语料库-齐切瓦语) 数据文件: - 训练划分:"monolingual/nyanja.parquet" - 配置名称:mono-tonga(单语语料库-通加语) 数据文件: - 训练划分:"monolingual/tonga.parquet" - 配置名称:mono-lozi(单语语料库-洛兹语) 数据文件: - 训练划分:"monolingual/lozi.parquet" - 配置名称:mono-luvale(单语语料库-卢瓦勒语) 数据文件: - 训练划分:"monolingual/luvale.parquet" - 配置名称:mono-lunda(单语语料库-隆达语) 数据文件: - 训练划分:"monolingual/lunda.parquet" - 配置名称:mono-kaonde(单语语料库-卡翁德语) 数据文件: - 训练划分:"monolingual/kaonde.parquet" - 配置名称:mono-full(单语语料库-全量) 数据文件: - 训练划分:"monolingual/*.parquet" 标签: - 赞比亚 - 班巴语 - 齐切瓦语 - 通加语 - 洛兹语 - 卢瓦勒语 - 隆达语 - 卡翁德语 - 低资源语言(Low-resource) - 平行语料库(Parallel Corpus) - 词典(Dictionary) - 单语语料库(Monolingual Corpus) - 自动语音识别(Automatic Speech Recognition,ASR) # Liseli 针对七种赞比亚语言开发的开放数据集:**班巴语、齐切瓦语(奇契瓦语)、通加语、洛兹语、卢瓦勒语、隆达语、卡翁德语**,均与英语配对。 本数据集整合了[Liseli项目](https://github.com/YumiMilling/liseli)所使用的四类数据集资产,所有内容均采用CC-BY-SA-4.0协议发布,以匹配限制性最强的上游源许可证。若使用特定子集,请引用原始来源。 ## 数据集资产类别 | 资产类型 | 规模 | 加载方式 | |---|---|---| | **平行语料库(Parallel Corpus)** | 242,986 条英↔xx语言对齐语料对 | `load_dataset("GiJoeHansFranz/Liseli", "parallel-bemba")` | | **词典(Dictionary)** | 43,010 条词条(英语→7种赞比亚语言) | `load_dataset("GiJoeHansFranz/Liseli", "dictionary")` | | **单语语料库(Monolingual Corpus)** | 298,770 条句子 | `load_dataset("GiJoeHansFranz/Liseli", "mono-nyanja")` | | **音频数据集** | 82 个语言课程MP3(MPEG Audio Layer 3)文件(约227 MB) | 存储于`language-courses/audio/`路径下 | 使用`parallel-full`或`mono-full`可拼接全部7种语言的语料。 ## 1. 平行语料库(`parallel-*`) 整体总语料对数量为**242,986**条,覆盖7种目标语言。语料构成高度偏向宗教文本,因此可将其视为多个子语料库的集合。 | 数据来源 | 近似语料对数量 | 说明 | |---|---|---| | `bible` | 约193,000条 | 经文对齐的平行语料,覆盖全部7种语言。占比最高但领域范围较窄。 | | `ai-dictionary` | 约32,000条 | **单词语汇查询对**,非句级对齐语料对,与词典资产存在重复。 | | `dmatekenya` | 约13,600条 | 仅齐切瓦语的农业推广文本。质量优异但领域单一。 | | `storybook` | 约2,900条 | 来自[赞比亚故事书项目](https://storybookszambia.net/)的儿童叙事文本。为日常语言内容中质量最高的部分,覆盖全部7种语言。 | | `wikimedia` | 约1,000条 | 维基百科来源的齐切瓦语平行语料对。 | | `tatoeba`、`moe`、`community` | <100条 | 杂项数据 | **若用于日常语言场景,有效子集为`storybook + wikimedia + dmatekenya`**——每种语言仅数千条语料对,而非总统计值。可通过`source`字段进行筛选。 **数据Schema**:`english`(英语原文)、`translation`(译文)、`language`(目标语言)、`domain`(领域)、`source`(数据来源)、`concept_id`(概念ID)、`sentence_id`(句子ID) ### 平行语料对按目标语言分布 | 目标语言 | 语料对总数 | |---|---| | 齐切瓦语 | 55,363 | | 洛兹语 | 35,711 | | 卢瓦勒语 | 35,547 | | 通加语 | 35,329 | | 班巴语 | 35,140 | | 隆达语 | 34,913 | | 卡翁德语 | 10,983 | ## 2. 词典(`dictionary`) 从开放许可的词典源(FENZA奇契瓦词典、Harris通加词典、奇通加语、班巴语爬取数据)中整合的经验证与爬取得到的英↔赞比亚语言单词词条。 **数据Schema**:`english`(英语单词)、`language`(目标语言)、`translation`(译文)、`status`(词条状态) ### 按目标语言分布的词典词条数 | 目标语言 | 词条数量 | |---|---| | 通加语 | 13,299 | | 齐切瓦语 | 10,121 | | 班巴语 | 4,547 | | 洛兹语 | 4,498 | | 卢瓦勒语 | 4,279 | | 隆达语 | 3,634 | | 卡翁德语 | 2,632 | ## 3. 单语语料库(`mono-*`) 针对单语言设计的句子语料库,可用于语言建模、自动语音识别提示以及词汇覆盖度评估。整合了圣经、教育部教学模块、赞比亚故事书项目、赞比西河之声转录文本、JW.org奇契瓦语内容、dmatekenya农业推广文本、Masakhane命名实体识别数据、USAID内容及其他PDF文本。 **数据Schema**:`language`(目标语言)、`text`(文本内容)、`source`(数据来源)、`source_file`(源文件)、`tier`(层级)、`domain`(领域)、`quality`(质量)、`dialect`(方言) ### 按目标语言分布的单语句子数 | 目标语言 | 句子数量 | |---|---| | 齐切瓦语 | 79,096 | | 洛兹语 | 43,931 | | 班巴语 | 40,104 | | 通加语 | 39,897 | | 卢瓦勒语 | 39,535 | | 隆达语 | 37,979 | | 卡翁德语 | 18,228 | ## 4. 音频数据集(`language-courses/audio/`) 从YouTube下载的82个语言教学MP3文件(约227 MB),涵盖班巴语、齐切瓦语、通加语的12个课程系列,包括Bembling课程、Kaputu班巴语教师、奇契瓦语101(hkatsonga)、通加语/齐切瓦语学习、Zedlexicon通加语课程等。多数课程的转录文本存储于Liseli项目Git仓库的`data/llm_extract_*.json`路径下。 该音频文件未作为Hugging Face数据集配置项存储,需通过以下代码拉取: python from huggingface_hub import snapshot_download snapshot_download( repo_id="GiJoeHansFranz/Liseli", repo_type="dataset", allow_patterns=["language-courses/audio/*.mp3"], local_dir="./liseli-audio", ) ## 加载示例 python from datasets import load_dataset # 加载仅班巴语的平行语料库,包含全部数据源 bem = load_dataset("GiJoeHansFranz/Liseli", "parallel-bemba", split="train") # 筛选仅日常语言的平行语料(排除圣经与机器生成的单词查询语料) daily = bem.filter(lambda r: r["source"] not in {"bible", "ai-dictionary"}) # 加载全部词典词条 d = load_dataset("GiJoeHansFranz/Liseli", "dictionary", split="train") bemba_entries = d.filter(lambda r: r["language"] == "bemba") # 加载拼接全部7种语言的单语语料库 mono = load_dataset("GiJoeHansFranz/Liseli", "mono-full", split="train") ## 许可与归因 上游源许可证: - **赞比亚故事书项目**:CC-BY - **BibleNLP / ebible**:各译本许可证不同(详见上游源) - **教育部教学模块**:赞比亚政府,公共领域 - **维基媒体**:CC-BY-SA - **双语PDF故事书**:CC-BY - **dmatekenya**:MIT协议(需标注上游作者) - **Masakhane命名实体识别**:CC-BY - **FENZA奇契瓦词典**:开放学术许可 - **Harris通加词典**:公共领域 - **赞比西河之声 / 班巴语语音数据集**:CC-BY-4.0 - **ai-dictionary**:大语言模型生成,未经人工验证;使用时请谨慎。 本聚合数据集采用**CC-BY-SA-4.0**协议发布,以适配限制性最强的上游源许可要求。 ## 已知局限性 - 平行语料库与单语语料库中宗教/正式语体占比过高。 - `ai-dictionary`词条为机器生成,未经人工验证。 - **卡翁德语**在所有维度(平行语料、词典、单语语料)的覆盖度均最低。 - 未标注方言信息。班巴语存在显著的区域变体,本数据集未覆盖。 - 未提供每行数据的质量评分;仅从数据源信任度角度将其视为“已验证”。 - 平行语料库中的`bible`语料与单语语料库中的圣经语料存在重叠——若同时使用两者,请进行去重处理。 - 暂无母语者发音录音(计划通过即将推出的 tutor 应用实现)。 ## 引用 若使用本数据集,请引用Liseli项目及所依赖的上游源。正式引用条目将在版本标记发布后补充。
提供机构:
GiJoeHansFranz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作