castorini/afriberta-corpus
收藏Dataset Card for AfriBERTas Corpus
Dataset Description
Dataset Summary
This corpus was used to train AfriBERTa. It primarily consists of data from the BBC news website, with additional data from Common Crawl for some languages.
Supported Tasks and Leaderboards
The primary use of this corpus is for pre-training language models.
Languages
- afaanoromoo
- amharic
- gahuza
- hausa
- igbo
- pidgin
- somali
- swahili
- tigrinya
- yoruba
Loading Dataset
-
To load the train split of the Somali corpus:
dataset = load_dataset("castorini/afriberta-corpus", "somali", split="train")
-
To load the test split of the Pidgin corpus:
dataset = load_dataset("castorini/afriberta-corpus", "pidgin", split="test")
Dataset Structure
Data Instances
Each data point is a line of text. An example from the igbo dataset:
{"id": "6", "text": "Ngwá ọrụ na-echebe ma na-ebuli gị na kọmputa."}
Data Fields
- id: id of the example
- text: content as a string
Data Splits
Each language has a train and test split, with varying sizes.
Considerations for Using the Data
Discussion of Biases
The dataset is biased towards the news domain due to its source from the BBC news website. Additionally, caution is advised for text generation models as the Common Crawl data may contain personal and sensitive information.



