five

castorini/afriberta-corpus

收藏
Hugging Face2022-10-19 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/castorini/afriberta-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
AfriBERTas Corpus数据集是用于训练AfriBERTa模型的语料库,数据主要来源于BBC新闻网站,部分语言的数据来自Common Crawl。该数据集支持多种非洲语言,包括阿法尔语、阿姆哈拉语、豪萨语等。每个数据点包含id和text字段,数据集分为训练集和测试集,且每个语言的数据集大小不一。使用该数据集时需要注意数据可能存在的偏见,因为大部分数据来自新闻网站,模型可能会偏向新闻领域。此外,部分数据来自Common Crawl,可能包含个人和敏感信息。
提供机构:
castorini
原始信息汇总

Dataset Card for AfriBERTas Corpus

Dataset Description

Dataset Summary

This corpus was used to train AfriBERTa. It primarily consists of data from the BBC news website, with additional data from Common Crawl for some languages.

Supported Tasks and Leaderboards

The primary use of this corpus is for pre-training language models.

Languages

  • afaanoromoo
  • amharic
  • gahuza
  • hausa
  • igbo
  • pidgin
  • somali
  • swahili
  • tigrinya
  • yoruba

Loading Dataset

  • To load the train split of the Somali corpus:

    dataset = load_dataset("castorini/afriberta-corpus", "somali", split="train")

  • To load the test split of the Pidgin corpus:

    dataset = load_dataset("castorini/afriberta-corpus", "pidgin", split="test")

Dataset Structure

Data Instances

Each data point is a line of text. An example from the igbo dataset:

{"id": "6", "text": "Ngwá ọrụ na-echebe ma na-ebuli gị na kọmputa."}

Data Fields

  • id: id of the example
  • text: content as a string

Data Splits

Each language has a train and test split, with varying sizes.

Considerations for Using the Data

Discussion of Biases

The dataset is biased towards the news domain due to its source from the BBC news website. Additionally, caution is advised for text generation models as the Common Crawl data may contain personal and sensitive information.

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作