niqqyniqqy/CiviVox-Swahili-text-corpus-v2.0

Name: niqqyniqqy/CiviVox-Swahili-text-corpus-v2.0
Creator: niqqyniqqy
Published: 2026-04-06 15:31:42
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/niqqyniqqy/CiviVox-Swahili-text-corpus-v2.0

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation tags: - legal size_categories: - 10M<n<100M --- # Swahili Text Dataset ## Overview This dataset contains a comprehensive collection of Swahili text data, derived from the [AfriBERTa Corpus](https://huggingface.co/datasets/castorini/afriberta-corpus). It provides a rich resource for natural language processing tasks focused on the Swahili language. ## Dataset Details - **Source**: [AfriBERTa Corpus](https://huggingface.co/datasets/castorini/afriberta-corpus) (Swahili subset) - **Language**: Swahili - **Size**: 1.54M - **Format**: Hugging Face Dataset ## Content The dataset consists of two main columns: 1. `id`: A unique identifier for each text entry 2. `text`: The Swahili text content ## Usage You can load this dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("Adeptschneider/CiviVox-Swahili-text-corpus-v2.0") ``` ## Data Fields - `id`: string - `text`: string ## Data Splits This dataset combines training and test splits from the original AfriBERTa Corpus. The data has been shuffled with a fixed seed (42) to ensure reproducibility. ## Dataset Creation This dataset was created by: 1. Loading the Swahili subset of the AfriBERTa Corpus 2. Concatenating the training and test splits 3. Shuffling the combined dataset 4. Extracting the 'id' and 'text' fields ## Intended Uses This dataset can be used for various natural language processing tasks involving the Swahili language, such as: - Language modeling - Text classification - Named entity recognition - Machine translation (as a source or target language) - Sentiment analysis - And more... ## Limitations - The dataset is limited to the content available in the original AfriBERTa Corpus. - It may not represent all dialects or variations of the Swahili language. - The quality and accuracy of the text content depend on the original data source. ## Citation If you use this dataset, please cite the original AfriBERTa Corpus: ``` @inproceedings{ogueji-etal-2021-small, title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages", author = "Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy", booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", } ``` ## Licensing Information This dataset is derived from the AfriBERTa Corpus. For usage terms and conditions, please refer to the [original dataset's license](https://huggingface.co/datasets/castorini/afriberta-corpus). ## Contact If you have questions or comments about this specific version of the dataset, please open an issue in this repository or contact ronleon76@gmail.com. --- The dataset was created and curated by AdeptSchneider. Last updated: 09/10/2024

--- 许可证：Apache-2.0 任务类别： - 文本生成标签： - 法律规模类别： - 1000万<样本数量<1亿 --- # 斯瓦希里语文本数据集（Swahili Text Dataset） ## 数据集概述本数据集汇集了全面的斯瓦希里语文本数据，其数据源自[AfriBERTa语料库（AfriBERTa Corpus）](https://huggingface.co/datasets/castorini/afriberta-corpus)，可为聚焦斯瓦希里语的自然语言处理任务提供丰富的资源支撑。 ## 数据集详情 - **来源**：[AfriBERTa语料库（AfriBERTa Corpus）](https://huggingface.co/datasets/castorini/afriberta-corpus)（斯瓦希里语子集） - **语言**：斯瓦希里语 - **规模**：154万（1.54M） - **格式**：Hugging Face数据集（Hugging Face Dataset） ## 数据集内容本数据集包含两个核心列： 1. `id`：每条文本条目的唯一标识符 2. `text`：斯瓦希里语文本内容 ## 使用方式可通过Hugging Face的`datasets`库加载本数据集： python from datasets import load_dataset dataset = load_dataset("Adeptschneider/CiviVox-Swahili-text-corpus-v2.0") ## 数据字段 - `id`：字符串类型 - `text`：字符串类型 ## 数据划分本数据集合并了原始AfriBERTa语料库的训练集与测试集划分，并使用固定随机种子（42）对数据进行洗牌，以确保实验结果可复现。 ## 数据集构建流程本数据集的构建步骤如下： 1. 加载AfriBERTa语料库的斯瓦希里语子集 2. 合并训练集与测试集划分 3. 对合并后的数据集进行洗牌操作 4. 提取`id`与`text`字段 ## 预期应用场景本数据集可应用于各类涉及斯瓦希里语的自然语言处理任务，例如： - 语言建模 - 文本分类 - 命名实体识别 - 机器翻译（可作为源语言或目标语言数据集） - 情感分析 - 以及更多应用场景…… ## 数据集局限性 - 本数据集的内容仅局限于原始AfriBERTa语料库中可用的数据范围 - 该数据集可能无法覆盖斯瓦希里语的所有方言与变体 - 文本内容的质量与准确性取决于原始数据源的品质 ## 引用方式若使用本数据集，请引用原始AfriBERTa语料库的相关文献： @inproceedings{ogueji-etal-2021-small, title = "小数据？没问题！探索预训练多语言模型在低资源语言中的应用可行性", author = "Ogueji, Kelechi 和 Zhu, Yuxin 和 Lin, Jimmy", booktitle = "第1届多语言表示学习研讨会论文集", month = 11月, year = "2021", address = "多米尼加共和国蓬塔卡纳", publisher = "计算语言学协会（Association for Computational Linguistics）", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", } ## 许可信息本数据集衍生自AfriBERTa语料库，有关使用条款与条件，请参阅[原始数据集的许可证](https://huggingface.co/datasets/castorini/afriberta-corpus)。 ## 联系方式若对本版本数据集有任何疑问或建议，请在本仓库中提交Issue，或联系邮箱ronleon76@gmail.com。本数据集由AdeptSchneider创建与整理。最后更新时间：2024年9月10日

提供机构：

niqqyniqqy