five

niqqyniqqy/CiviVox-Swahili-text-corpus-v2.0

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/niqqyniqqy/CiviVox-Swahili-text-corpus-v2.0
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation tags: - legal size_categories: - 10M<n<100M --- # Swahili Text Dataset ## Overview This dataset contains a comprehensive collection of Swahili text data, derived from the [AfriBERTa Corpus](https://huggingface.co/datasets/castorini/afriberta-corpus). It provides a rich resource for natural language processing tasks focused on the Swahili language. ## Dataset Details - **Source**: [AfriBERTa Corpus](https://huggingface.co/datasets/castorini/afriberta-corpus) (Swahili subset) - **Language**: Swahili - **Size**: 1.54M - **Format**: Hugging Face Dataset ## Content The dataset consists of two main columns: 1. `id`: A unique identifier for each text entry 2. `text`: The Swahili text content ## Usage You can load this dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("Adeptschneider/CiviVox-Swahili-text-corpus-v2.0") ``` ## Data Fields - `id`: string - `text`: string ## Data Splits This dataset combines training and test splits from the original AfriBERTa Corpus. The data has been shuffled with a fixed seed (42) to ensure reproducibility. ## Dataset Creation This dataset was created by: 1. Loading the Swahili subset of the AfriBERTa Corpus 2. Concatenating the training and test splits 3. Shuffling the combined dataset 4. Extracting the 'id' and 'text' fields ## Intended Uses This dataset can be used for various natural language processing tasks involving the Swahili language, such as: - Language modeling - Text classification - Named entity recognition - Machine translation (as a source or target language) - Sentiment analysis - And more... ## Limitations - The dataset is limited to the content available in the original AfriBERTa Corpus. - It may not represent all dialects or variations of the Swahili language. - The quality and accuracy of the text content depend on the original data source. ## Citation If you use this dataset, please cite the original AfriBERTa Corpus: ``` @inproceedings{ogueji-etal-2021-small, title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages", author = "Ogueji, Kelechi and Zhu, Yuxin and Lin, Jimmy", booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning", month = nov, year = "2021", address = "Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", } ``` ## Licensing Information This dataset is derived from the AfriBERTa Corpus. For usage terms and conditions, please refer to the [original dataset's license](https://huggingface.co/datasets/castorini/afriberta-corpus). ## Contact If you have questions or comments about this specific version of the dataset, please open an issue in this repository or contact ronleon76@gmail.com. --- The dataset was created and curated by AdeptSchneider. Last updated: 09/10/2024

--- 许可证:Apache-2.0 任务类别: - 文本生成 标签: - 法律 规模类别: - 1000万<样本数量<1亿 --- # 斯瓦希里语文本数据集(Swahili Text Dataset) ## 数据集概述 本数据集汇集了全面的斯瓦希里语文本数据,其数据源自[AfriBERTa语料库(AfriBERTa Corpus)](https://huggingface.co/datasets/castorini/afriberta-corpus),可为聚焦斯瓦希里语的自然语言处理任务提供丰富的资源支撑。 ## 数据集详情 - **来源**:[AfriBERTa语料库(AfriBERTa Corpus)](https://huggingface.co/datasets/castorini/afriberta-corpus)(斯瓦希里语子集) - **语言**:斯瓦希里语 - **规模**:154万(1.54M) - **格式**:Hugging Face数据集(Hugging Face Dataset) ## 数据集内容 本数据集包含两个核心列: 1. `id`:每条文本条目的唯一标识符 2. `text`:斯瓦希里语文本内容 ## 使用方式 可通过Hugging Face的`datasets`库加载本数据集: python from datasets import load_dataset dataset = load_dataset("Adeptschneider/CiviVox-Swahili-text-corpus-v2.0") ## 数据字段 - `id`:字符串类型 - `text`:字符串类型 ## 数据划分 本数据集合并了原始AfriBERTa语料库的训练集与测试集划分,并使用固定随机种子(42)对数据进行洗牌,以确保实验结果可复现。 ## 数据集构建流程 本数据集的构建步骤如下: 1. 加载AfriBERTa语料库的斯瓦希里语子集 2. 合并训练集与测试集划分 3. 对合并后的数据集进行洗牌操作 4. 提取`id`与`text`字段 ## 预期应用场景 本数据集可应用于各类涉及斯瓦希里语的自然语言处理任务,例如: - 语言建模 - 文本分类 - 命名实体识别 - 机器翻译(可作为源语言或目标语言数据集) - 情感分析 - 以及更多应用场景…… ## 数据集局限性 - 本数据集的内容仅局限于原始AfriBERTa语料库中可用的数据范围 - 该数据集可能无法覆盖斯瓦希里语的所有方言与变体 - 文本内容的质量与准确性取决于原始数据源的品质 ## 引用方式 若使用本数据集,请引用原始AfriBERTa语料库的相关文献: @inproceedings{ogueji-etal-2021-small, title = "小数据?没问题!探索预训练多语言模型在低资源语言中的应用可行性", author = "Ogueji, Kelechi 和 Zhu, Yuxin 和 Lin, Jimmy", booktitle = "第1届多语言表示学习研讨会论文集", month = 11月, year = "2021", address = "多米尼加共和国蓬塔卡纳", publisher = "计算语言学协会(Association for Computational Linguistics)", url = "https://aclanthology.org/2021.mrl-1.11", pages = "116--126", } ## 许可信息 本数据集衍生自AfriBERTa语料库,有关使用条款与条件,请参阅[原始数据集的许可证](https://huggingface.co/datasets/castorini/afriberta-corpus)。 ## 联系方式 若对本版本数据集有任何疑问或建议,请在本仓库中提交Issue,或联系邮箱ronleon76@gmail.com。 本数据集由AdeptSchneider创建与整理。 最后更新时间:2024年9月10日
提供机构:
niqqyniqqy
二维码
社区交流群
二维码
科研交流群
商业服务