mdonigian/iab-news-classification
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mdonigian/iab-news-classification
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
language:
- en
tags:
- news
- iab
- content-taxonomy
- bert-training-data
pretty_name: IAB News Article Classification
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# IAB News Article Classification Dataset
A labeled dataset of **106,280 English-language news articles** classified into **35 IAB Content Taxonomy 3.1 Tier 1 categories**. Intended as training data for fine-tuning a BERT-based text classifier.
## Dataset Description
Each row contains the full article text, its source URL and domain, and a single IAB Tier 1 category label assigned by GPT-5-nano via the OpenAI Batch API.
### Columns
| Column | Type | Description |
|---|---|---|
| `url` | string | Source URL of the article |
| `domain` | string | Publisher domain (e.g. `nytimes.com`) |
| `maintext` | string | Extracted article body text |
| `iab_category` | string | IAB Content Taxonomy 3.1 Tier 1 label |
### Category Distribution
| Category | Count |
|---|---:|
| Politics | 16,867 |
| Sports | 11,401 |
| Business and Finance | 8,157 |
| Entertainment | 7,092 |
| Medical Health | 5,686 |
| Crime | 5,447 |
| Technology & Computing | 4,927 |
| Food & Drink | 4,740 |
| Science | 4,576 |
| Travel | 3,474 |
| Home & Garden | 3,417 |
| Pop Culture | 3,259 |
| Law | 2,212 |
| War and Conflicts | 2,157 |
| Disasters | 2,102 |
| Healthy Living | 2,065 |
| Style & Fashion | 1,613 |
| Automotive | 1,596 |
| Education | 1,551 |
| Personal Finance | 1,501 |
| Shopping | 1,379 |
| Pets | 1,304 |
| Fine Art | 1,290 |
| Family and Relationships | 1,213 |
| Real Estate | 1,035 |
| Video Gaming | 1,018 |
| Events | 836 |
| Religion & Spirituality | 753 |
| Hobbies & Interests | 711 |
| Books and Literature | 653 |
| Personal Celebrations & Life Events | 620 |
| Attractions | 538 |
| Careers | 515 |
| Communication | 376 |
| Holidays | 199 |
### Data Sources
Articles were collected from two sources:
1. **Common Crawl CC-NEWS** — WARC archives filtered to a curated list of major English-language news domains
2. **Spider.cloud** — Targeted crawls of additional domains to improve coverage of underrepresented IAB categories
Text was extracted from raw HTML using `readability-lxml` and BeautifulSoup, then filtered for minimum length and quality.
### Labeling Process
Labels were generated using **GPT-5-nano** through the OpenAI Batch API with structured outputs (JSON schema constrained to valid Tier 1 categories). Each article's first ~512 words were sent with the full list of 35 valid categories. The model was instructed to select exactly one Tier 1 category per article.
## Intended Use
This dataset is designed for **fine-tuning a BERT (or similar transformer) model** to classify news articles into IAB Content Taxonomy Tier 1 categories — replacing the LLM labeler with a fast, cheap, local classifier suitable for production-scale inference.
### Suggested Train/Test Split
The dataset is provided as a single split. A typical approach:
```python
from datasets import load_dataset
ds = load_dataset("YOUR_USERNAME/YOUR_REPO_NAME")
ds = ds["train"].train_test_split(test_size=0.1, seed=42, stratify_by_column="iab_category")
```
## Limitations
- Labels are LLM-generated, not human-verified. Expect some noise, particularly for ambiguous articles that could fit multiple categories.
- Category distribution is imbalanced — reflects the natural distribution of news topics plus targeted crawling for underrepresented categories.
- Articles are predominantly from US/UK English-language publishers.
- The `maintext` field quality depends on the HTML extraction pipeline; some articles may have formatting artifacts.
## Citation
If you use this dataset, please reference the IAB Content Taxonomy:
> Interactive Advertising Bureau. *IAB Content Taxonomy 3.1.*
> https://github.com/InteractiveAdvertisingBureau/Taxonomies
提供机构:
mdonigian



