five

mdonigian/iab-news-classification

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mdonigian/iab-news-classification
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en tags: - news - iab - content-taxonomy - bert-training-data pretty_name: IAB News Article Classification size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: data/train-* --- # IAB News Article Classification Dataset A labeled dataset of **106,280 English-language news articles** classified into **35 IAB Content Taxonomy 3.1 Tier 1 categories**. Intended as training data for fine-tuning a BERT-based text classifier. ## Dataset Description Each row contains the full article text, its source URL and domain, and a single IAB Tier 1 category label assigned by GPT-5-nano via the OpenAI Batch API. ### Columns | Column | Type | Description | |---|---|---| | `url` | string | Source URL of the article | | `domain` | string | Publisher domain (e.g. `nytimes.com`) | | `maintext` | string | Extracted article body text | | `iab_category` | string | IAB Content Taxonomy 3.1 Tier 1 label | ### Category Distribution | Category | Count | |---|---:| | Politics | 16,867 | | Sports | 11,401 | | Business and Finance | 8,157 | | Entertainment | 7,092 | | Medical Health | 5,686 | | Crime | 5,447 | | Technology & Computing | 4,927 | | Food & Drink | 4,740 | | Science | 4,576 | | Travel | 3,474 | | Home & Garden | 3,417 | | Pop Culture | 3,259 | | Law | 2,212 | | War and Conflicts | 2,157 | | Disasters | 2,102 | | Healthy Living | 2,065 | | Style & Fashion | 1,613 | | Automotive | 1,596 | | Education | 1,551 | | Personal Finance | 1,501 | | Shopping | 1,379 | | Pets | 1,304 | | Fine Art | 1,290 | | Family and Relationships | 1,213 | | Real Estate | 1,035 | | Video Gaming | 1,018 | | Events | 836 | | Religion & Spirituality | 753 | | Hobbies & Interests | 711 | | Books and Literature | 653 | | Personal Celebrations & Life Events | 620 | | Attractions | 538 | | Careers | 515 | | Communication | 376 | | Holidays | 199 | ### Data Sources Articles were collected from two sources: 1. **Common Crawl CC-NEWS** — WARC archives filtered to a curated list of major English-language news domains 2. **Spider.cloud** — Targeted crawls of additional domains to improve coverage of underrepresented IAB categories Text was extracted from raw HTML using `readability-lxml` and BeautifulSoup, then filtered for minimum length and quality. ### Labeling Process Labels were generated using **GPT-5-nano** through the OpenAI Batch API with structured outputs (JSON schema constrained to valid Tier 1 categories). Each article's first ~512 words were sent with the full list of 35 valid categories. The model was instructed to select exactly one Tier 1 category per article. ## Intended Use This dataset is designed for **fine-tuning a BERT (or similar transformer) model** to classify news articles into IAB Content Taxonomy Tier 1 categories — replacing the LLM labeler with a fast, cheap, local classifier suitable for production-scale inference. ### Suggested Train/Test Split The dataset is provided as a single split. A typical approach: ```python from datasets import load_dataset ds = load_dataset("YOUR_USERNAME/YOUR_REPO_NAME") ds = ds["train"].train_test_split(test_size=0.1, seed=42, stratify_by_column="iab_category") ``` ## Limitations - Labels are LLM-generated, not human-verified. Expect some noise, particularly for ambiguous articles that could fit multiple categories. - Category distribution is imbalanced — reflects the natural distribution of news topics plus targeted crawling for underrepresented categories. - Articles are predominantly from US/UK English-language publishers. - The `maintext` field quality depends on the HTML extraction pipeline; some articles may have formatting artifacts. ## Citation If you use this dataset, please reference the IAB Content Taxonomy: > Interactive Advertising Bureau. *IAB Content Taxonomy 3.1.* > https://github.com/InteractiveAdvertisingBureau/Taxonomies
提供机构:
mdonigian
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作