niqqyniqqy/CiviVox-Swahili-text-corpus-v2.0
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/niqqyniqqy/CiviVox-Swahili-text-corpus-v2.0
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
tags:
- legal
size_categories:
- 10M<n<100M
---
# Swahili Text Dataset
## Overview
This dataset contains a comprehensive collection of Swahili text data, derived from the [AfriBERTa Corpus](https://huggingface.co/datasets/castorini/afriberta-corpus). It provides a rich resource for natural language processing tasks focused on the Swahili language.
## Dataset Details
- **Source**: [AfriBERTa Corpus](https://huggingface.co/datasets/castorini/afriberta-corpus) (Swahili subset)
- **Language**: Swahili
- **Size**: 1.54M
- **Format**: Hugging Face Dataset
## Content
The dataset consists of two main columns:
1. `id`: A unique identifier for each text entry
2. `text`: The Swahili text content
## Usage
You can load this dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("Adeptschneider/CiviVox-Swahili-text-corpus-v2.0")
```
## Data Fields
- `id`: string
- `text`: string
## Data Splits
This dataset combines training and test splits from the original AfriBERTa Corpus. The data has been shuffled with a fixed seed (42) to ensure reproducibility.
## Dataset Creation
This dataset was created by:
1. Loading the Swahili subset of the AfriBERTa Corpus
2. Concatenating the training and test splits
3. Shuffling the combined dataset
4. Extracting the 'id' and 'text' fields
## Intended Uses
This dataset can be used for various natural language processing tasks involving the Swahili language, such as:
- Language modeling
- Text classification
- Named entity recognition
- Machine translation (as a source or target language)
- Sentiment analysis
- And more...
## Limitations
- The dataset is limited to the content available in the original AfriBERTa Corpus.
- It may not represent all dialects or variations of the Swahili language.
- The quality and accuracy of the text content depend on the original data source.
## Citation
If you use this dataset, please cite the original AfriBERTa Corpus:
```
@inproceedings{ogueji-etal-2021-small,
title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages",
author = "Ogueji, Kelechi and
Zhu, Yuxin and
Lin, Jimmy",
booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.mrl-1.11",
pages = "116--126",
}
```
## Licensing Information
This dataset is derived from the AfriBERTa Corpus. For usage terms and conditions, please refer to the [original dataset's license](https://huggingface.co/datasets/castorini/afriberta-corpus).
## Contact
If you have questions or comments about this specific version of the dataset, please open an issue in this repository or contact ronleon76@gmail.com.
---
The dataset was created and curated by AdeptSchneider.
Last updated: 09/10/2024
---
许可证:Apache-2.0
任务类别:
- 文本生成
标签:
- 法律
规模类别:
- 1000万<样本数量<1亿
---
# 斯瓦希里语文本数据集(Swahili Text Dataset)
## 数据集概述
本数据集汇集了全面的斯瓦希里语文本数据,其数据源自[AfriBERTa语料库(AfriBERTa Corpus)](https://huggingface.co/datasets/castorini/afriberta-corpus),可为聚焦斯瓦希里语的自然语言处理任务提供丰富的资源支撑。
## 数据集详情
- **来源**:[AfriBERTa语料库(AfriBERTa Corpus)](https://huggingface.co/datasets/castorini/afriberta-corpus)(斯瓦希里语子集)
- **语言**:斯瓦希里语
- **规模**:154万(1.54M)
- **格式**:Hugging Face数据集(Hugging Face Dataset)
## 数据集内容
本数据集包含两个核心列:
1. `id`:每条文本条目的唯一标识符
2. `text`:斯瓦希里语文本内容
## 使用方式
可通过Hugging Face的`datasets`库加载本数据集:
python
from datasets import load_dataset
dataset = load_dataset("Adeptschneider/CiviVox-Swahili-text-corpus-v2.0")
## 数据字段
- `id`:字符串类型
- `text`:字符串类型
## 数据划分
本数据集合并了原始AfriBERTa语料库的训练集与测试集划分,并使用固定随机种子(42)对数据进行洗牌,以确保实验结果可复现。
## 数据集构建流程
本数据集的构建步骤如下:
1. 加载AfriBERTa语料库的斯瓦希里语子集
2. 合并训练集与测试集划分
3. 对合并后的数据集进行洗牌操作
4. 提取`id`与`text`字段
## 预期应用场景
本数据集可应用于各类涉及斯瓦希里语的自然语言处理任务,例如:
- 语言建模
- 文本分类
- 命名实体识别
- 机器翻译(可作为源语言或目标语言数据集)
- 情感分析
- 以及更多应用场景……
## 数据集局限性
- 本数据集的内容仅局限于原始AfriBERTa语料库中可用的数据范围
- 该数据集可能无法覆盖斯瓦希里语的所有方言与变体
- 文本内容的质量与准确性取决于原始数据源的品质
## 引用方式
若使用本数据集,请引用原始AfriBERTa语料库的相关文献:
@inproceedings{ogueji-etal-2021-small,
title = "小数据?没问题!探索预训练多语言模型在低资源语言中的应用可行性",
author = "Ogueji, Kelechi 和
Zhu, Yuxin 和
Lin, Jimmy",
booktitle = "第1届多语言表示学习研讨会论文集",
month = 11月,
year = "2021",
address = "多米尼加共和国蓬塔卡纳",
publisher = "计算语言学协会(Association for Computational Linguistics)",
url = "https://aclanthology.org/2021.mrl-1.11",
pages = "116--126",
}
## 许可信息
本数据集衍生自AfriBERTa语料库,有关使用条款与条件,请参阅[原始数据集的许可证](https://huggingface.co/datasets/castorini/afriberta-corpus)。
## 联系方式
若对本版本数据集有任何疑问或建议,请在本仓库中提交Issue,或联系邮箱ronleon76@gmail.com。
本数据集由AdeptSchneider创建与整理。
最后更新时间:2024年9月10日
提供机构:
niqqyniqqy



