Adeptschneider/CiviVox-Swahili-text-corpus-v2.0

Name: Adeptschneider/CiviVox-Swahili-text-corpus-v2.0
Creator: Adeptschneider
Published: 2024-10-09 04:11:33
License: 暂无描述

Hugging Face2024-10-09 更新2024-12-14 收录

下载链接：

https://hf-mirror.com/datasets/Adeptschneider/CiviVox-Swahili-text-corpus-v2.0

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含从AfriBERTa语料库中提取的斯瓦希里语文本数据的全面集合。它提供了专注于斯瓦希里语的自然语言处理任务的丰富资源。数据集包含两个主要列：id（每个文本条目的唯一标识符）和text（斯瓦希里语文本内容）。数据集的大小为1.54M，格式为Hugging Face数据集。数据集可以用于各种自然语言处理任务，如语言建模、文本分类、命名实体识别、机器翻译、情感分析等。数据集的局限性包括仅限于原始AfriBERTa语料库中的内容，可能不代表所有斯瓦希里语的方言或变体，文本内容的质量和准确性取决于原始数据源。

This dataset contains a comprehensive collection of Swahili text data, derived from the AfriBERTa Corpus. It provides a rich resource for natural language processing tasks focused on the Swahili language. The dataset consists of two main columns: id (a unique identifier for each text entry) and text (the Swahili text content). The size of the dataset is 1.54M, and the format is Hugging Face Dataset. The dataset can be used for various natural language processing tasks such as language modeling, text classification, named entity recognition, machine translation, sentiment analysis, and more. The limitations of the dataset include being limited to the content available in the original AfriBERTa Corpus, possibly not representing all dialects or variations of the Swahili language, and the quality and accuracy of the text content depending on the original data source.

提供机构：

Adeptschneider

5,000+

优质数据集

54 个

任务类型

进入经典数据集