castorini/afriberta-corpus

Name: castorini/afriberta-corpus
Creator: castorini
Published: 2022-10-19 21:33:04
License: 暂无描述

Hugging Face2022-10-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/castorini/afriberta-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

AfriBERTas Corpus数据集是用于训练AfriBERTa模型的语料库，数据主要来源于BBC新闻网站，部分语言的数据来自Common Crawl。该数据集支持多种非洲语言，包括阿法尔语、阿姆哈拉语、豪萨语等。每个数据点包含id和text字段，数据集分为训练集和测试集，且每个语言的数据集大小不一。使用该数据集时需要注意数据可能存在的偏见，因为大部分数据来自新闻网站，模型可能会偏向新闻领域。此外，部分数据来自Common Crawl，可能包含个人和敏感信息。

提供机构：

castorini

原始信息汇总

Dataset Card for AfriBERTas Corpus

Dataset Description

Dataset Summary

This corpus was used to train AfriBERTa. It primarily consists of data from the BBC news website, with additional data from Common Crawl for some languages.

Supported Tasks and Leaderboards

The primary use of this corpus is for pre-training language models.

Languages

afaanoromoo
amharic
gahuza
hausa
igbo
pidgin
somali
swahili
tigrinya
yoruba

Loading Dataset

To load the train split of the Somali corpus:

dataset = load_dataset("castorini/afriberta-corpus", "somali", split="train")
To load the test split of the Pidgin corpus:

dataset = load_dataset("castorini/afriberta-corpus", "pidgin", split="test")

Dataset Structure

Data Instances

Each data point is a line of text. An example from the igbo dataset:

{"id": "6", "text": "Ngwá ọrụ na-echebe ma na-ebuli gị na kọmputa."}

Data Fields

id: id of the example
text: content as a string

Data Splits

Each language has a train and test split, with varying sizes.

Considerations for Using the Data

Discussion of Biases

The dataset is biased towards the news domain due to its source from the BBC news website. Additionally, caution is advised for text generation models as the Common Crawl data may contain personal and sensitive information.

5,000+

优质数据集

54 个

任务类型

进入经典数据集