naijaweb
收藏魔搭社区2025-11-12 更新2024-11-02 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/naijaweb
下载链接
链接失效反馈官方服务:
资源简介:
# Naijaweb Dataset 🇳🇬
**Naijaweb** is a dataset that contains over **270,000+ documents**, totaling approximately **230 million GPT-2 tokens**. The data was web scraped from web pages popular among Nigerians, providing a rich resource for modeling Nigerian linguistic and cultural contexts.
## Dataset Summary
| Features | Data Types |
|----------------|-------------|
| text | string |
| link | string |
| token_count | int64 |
| section | string |
| int_score | int64 |
| language | string |
| language_probability | float64 |
## Data Collection
The dataset was collected from **Nairaland.com**, extracting **about 30 million unique posts** from 19 different sections of the site. Additionally, **1,289,195 outbound links** were extracted from these posts. The content of these web pages was extracted using **Trafilatura**, a popular library for web scraping and content extraction.
The full data collection can be found [in this repo](https://github.com/saheedniyi02/Naijaweb), kindly give a star⭐
## Data Cleaning
The cleaning process was conducted using **[Datatrove](https://github.com/huggingface/datatrove)**, the same library employed in cleaning the **[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)** dataset, which is known for its high quality. The data cleaning process involved multiple stages of deduplication, filtering, and normalization to ensure the dataset's quality matches that of other high-performing datasets.
### Data Cleaning Procedure:
- **URL Filtering**
- **Repitition and quality filtering:**
- **Personal Identifiable Information (PII) Removal**
## Example Entry
Each data point contains the following fields:
- `text`: the main body of the post or web page
- `link`: the original URL of the source content
- `token_count`: the number of GPT2 tokens in the `text` field
- `section`: the Nairaland section where the post was found
- `int_score`: an integer representation of the 'educational quality' of the data based on [fineweb's webpage educational classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
- `language`: detected language of the text (e.g., `en`, `yo`, `ha`, `ig`)
- `language_probability`: the confidence score of the language detection algorithm
An example looks as follows:
```
{
'text': 'Governor Samuel Ortom of Benue State\nBy Peter Duru\nGovernor Samuel Ortom of Benue state has commended President Muhammadu Buhari for his directive to security agents to shoot anyone illegally bearing AK47 rifle in the country.\nThe Governor who gave the commendation Thursday in Makurdi said the President’s order would reduce the level of criminality, banditry and militia herders’ attacks on Benue communities as well as in other parts of the country.\nAccording to him, “the order would also make the communities safer for displaced farmers to return to their ancestral homes.\n“I wish to commend Mr. President for his recent order against those bearing AK47 rifles. This I am sure will reduce the high rate of criminality, banditary and militia herdsmen attacks on our farming communities,” the Governor said.\nHe noted that President Buhari had done the right thing by listening to the calls he and other concerned Nigerians made on the need for the Federal Government to act faster and decisively to save the country from degenerating to a state of anarchy.\n“I don’t only criticise, I also commend where necessary. And I want to say shame on those sycophants who were bashing me for writing to Mr. President because he has finally heeded my advice,” he added.\nGovernor Ortom said Nigeria belonged to all its citizens and only justice and equity anchored on the rule of law could guarantee the unity and stability of the country.\nComments expressed here do not reflect the opinions of Vanguard newspapers or any employee thereof.',
'link': 'https://www.vanguardngr.com/2021/03/ortom-commends-buhari-on-shoot-at-sight-order-on-ak47-bearing-criminals/amp/',
'token_count': 332,
'section': 'Politics',
'int_score': 1,
'language': 'en',
'language_probability': 0.9999465942382812
}
```
## Data Splits
- **Training Split:** 270,137 examples (620MB in size)
## How to Load the Dataset
To load the dataset using Hugging Face's `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("saheedniyi/naijaweb")
```
## Social Impact
Naijaweb was created to make Nigerian web data more accessible, providing researchers and developers with a dataset rich in Nigerian contexts across various domains such as **Politics**, **Education**, **Business**, and **Health**.
## Bias and Ethical Considerations
Since the data is collected from publicly available web pages, inherent biases present in the sources may be reflected in the dataset. These biases can manifest in areas such as **language**, **ideology**, or **topic representation**. Users should be mindful of these potential biases when developing models, especially for sensitive areas like **legal** or **medical** information.
## Sections of the Dataset
The dataset comprises content from 19 different sections of **Nairaland.com**, covering topics such as **Politics**, **Education**, **Business**, and **Health**.
Citation
If you use the Naijaweb dataset in your research, please cite it as follows:
```
@dataset{naijaweb_2024,
author = {Saheed Azeez},
title = {Naijaweb: A Web Scraped Nigerian Context Dataset},
year = {2024},
publisher = {Hugging Face Datasets},
version = {1.0.0},
url = {https://huggingface.co/datasets/saheedniyi/naijaweb},
}
```
# Naijaweb数据集 🇳🇬
**Naijaweb** 是一个包含超27万份文档的数据集,总Token数约达2.3亿个GPT-2 Token。该数据集通过爬取尼日利亚民众常用的网页内容构建,为尼日利亚语言与文化语境的建模提供了丰富的资源。
## 数据集概览
| 特征 | 数据类型 |
|--------------|-----------|
| text | string |
| link | string |
| token_count | int64 |
| section | string |
| int_score | int64 |
| language | string |
| language_probability | float64 |
## 数据采集
本数据集采集自**Nairaland.com**,从该网站的19个不同板块中提取了约3000万条唯一帖子。此外,还从这些帖子中提取了1289195条出站链接。网页内容的提取使用了**Trafilatura**——一款广受欢迎的网页爬取与内容提取库。完整的数据集采集代码可参见[该仓库](https://github.com/saheedniyi02/Naijaweb),欢迎点亮Star⭐
## 数据清洗
本数据集的清洗流程使用了**[Datatrove](https://github.com/huggingface/datatrove)**——该库同样被用于清洗以高质量著称的**[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)**数据集。数据清洗包含多阶段去重、过滤与归一化操作,以确保本数据集的质量可与其他高性能数据集媲美。
### 数据清洗流程:
- **URL过滤**
- **重复与质量过滤**
- **个人身份信息(Personal Identifiable Information, PII)移除**
## 数据条目示例
每条数据包含以下字段:
- `text`:帖子或网页的正文内容
- `link`:源内容的原始URL
- `token_count`:`text`字段中的GPT-2 Token数量
- `section`:帖子所属的Nairaland板块
- `int_score`:基于[FineWeb网页教育质量分类器](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)计算得到的数据集「教育质量」整数评分
- `language`:文本检测出的语言(例如`en`、`yo`、`ha`、`ig`)
- `language_probability`:语言检测算法输出的置信度得分
以下为一条数据示例:
{
'text': 'Governor Samuel Ortom of Benue State
By Peter Duru
Governor Samuel Ortom of Benue state has commended President Muhammadu Buhari for his directive to security agents to shoot anyone illegally bearing AK47 rifle in the country.
The Governor who gave the commendation Thursday in Makurdi said the President’s order would reduce the level of criminality, banditry and militia herders’ attacks on Benue communities as well as in other parts of the country.
According to him, “the order would also make the communities safer for displaced farmers to return to their ancestral homes.
“I wish to commend Mr. President for his recent order against those bearing AK47 rifles. This I am sure will reduce the high rate of criminality, banditary and militia herdsmen attacks on our farming communities,” the Governor said.
He noted that President Buhari had done the right thing by listening to the calls he and other concerned Nigerians made on the need for the Federal Government to act faster and decisively to save the country from degenerating to a state of anarchy.
“I don’t only criticise, I also commend where necessary. And I want to say shame on those sycophants who were bashing me for writing to Mr. President because he has finally heeded my advice,” he added.
Governor Ortom said Nigeria belonged to all its citizens and only justice and equity anchored on the rule of law could guarantee the unity and stability of the country.
Comments expressed here do not reflect the opinions of Vanguard newspapers or any employee thereof.',
'link': 'https://www.vanguardngr.com/2021/03/ortom-commends-buhari-on-shoot-at-sight-order-on-ak47-bearing-criminals/amp/',
'token_count': 332,
'section': 'Politics',
'int_score': 1,
'language': 'en',
'language_probability': 0.9999465942382812
}
## 数据划分
- **训练集**:共270137条数据,大小为620MB
## 数据集加载方式
可通过Hugging Face的`datasets`库加载本数据集:
python
from datasets import load_dataset
dataset = load_dataset("saheedniyi/naijaweb")
## 社会价值
Naijaweb的构建旨在提升尼日利亚网络数据的可及性,为研究人员与开发者提供涵盖政治、教育、商业、健康等多个领域的尼日利亚语境丰富数据集。
## 偏差与伦理考量
由于本数据集采集自公开网页,源数据中存在的固有偏差可能会反映在数据集中。这些偏差可能体现在语言、意识形态或主题表征等方面。开发者在构建模型时需留意这些潜在偏差,尤其是在处理法律或医疗等敏感领域的信息时。
## 数据集覆盖板块
本数据集包含**Nairaland.com**的19个不同板块的内容,涵盖政治、教育、商业、健康等主题。
## 引用方式
若您在研究中使用本数据集,请按以下方式引用:
bibtex
@dataset{naijaweb_2024,
author = {Saheed Azeez},
title = {Naijaweb: A Web Scraped Nigerian Context Dataset},
year = {2024},
publisher = {Hugging Face Datasets},
version = {1.0.0},
url = {"https://huggingface.co/datasets/saheedniyi/naijaweb"},
}
提供机构:
maas
创建时间:
2024-10-29



