ratishsp/fineweb-edu-misinfo
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ratishsp/fineweb-edu-misinfo
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-classification
language:
- en
tags:
- misinformation
- content-safety
- fineweb
- pretraining-data
- data-quality
size_categories:
- 100K<n<1M
---
# FineWeb-Edu Misinformation Audit
A dataset of 200K documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) annotated for misinformation content. FineWeb-Edu uses a Snowflake-arctic-embed embedding model classifier to select "web pages of educational value", but the classifier optimizes for surface-level markers of educational writing (structure, citations, academic tone) and cannot assess whether content is factually accurate or ideologically motivated.
## Dataset composition
- **100K documents from known problematic domains** across 7 categories: pseudoscience, climate denial, conspiracy, antivax/medical misinformation, propaganda, hate/extremism, and Holocaust denial
- **100K randomly sampled documents** from FineWeb-Edu to establish background rates
All documents were annotated by Llama 4 Maverick (meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8) on Together AI for misinformation content.
## Fields
| Field | Description |
|---|---|
| `url` | Original URL from FineWeb-Edu |
| `domain` | Domain name extracted from URL (e.g. naturalnews.com) |
| `domain_category` | Source domain category (null for random sample) |
| `edu_score` | FineWeb-Edu classifier score |
| `text` | Full document text |
| `llama_label` | Annotation: benign, health_misinfo, pseudoscience, climate_denial, conspiracy_propaganda, hate_extremism |
| `llama_confidence` | Annotator confidence (high/medium/low) |
| `llama_reason` | Free-text reasoning |
## Key findings
- **4.1% of randomly sampled FineWeb-Edu documents** contain misinformation (health misinformation, pseudoscience, conspiracy theories, etc.)
- **39% of documents from known problematic domains** contain misinformation content, nearly 10x the overall rate
- The full FineWeb-Edu dataset contains 1.53 billion documents. Extrapolating the 4.1% rate, approximately **63 million documents** may contain misinformation. From known problematic domains alone, an estimated **3 million documents** contain confirmed misinformation content.
## Label distribution
### Flagged domains (100K)
| Label | Count |
|---|---|
| benign | 60,804 |
| pseudoscience | 16,179 |
| health_misinfo | 10,142 |
| climate_denial | 5,393 |
| conspiracy_propaganda | 4,878 |
| hate_extremism | 2,261 |
### Random sample (100K)
| Label | Count |
|---|---|
| benign | 95,865 |
| health_misinfo | 2,087 |
| pseudoscience | 1,123 |
| conspiracy_propaganda | 475 |
| hate_extremism | 236 |
| climate_denial | 199 |
## Inter-annotator agreement
600 documents (300 flagged, 300 random) were independently annotated by both Llama 4 Maverick and Claude Sonnet 4.6 using the same prompt. Agreement was measured using Cohen's kappa.
| Subset | n | Binary agreement | Binary kappa | Multiclass agreement | Multiclass kappa |
|---|---|---|---|---|---|
| Flagged | 300 | 92.0% | 0.831 | 90.3% | 0.832 |
| Random | 300 | 98.7% | 0.850 | 98.0% | 0.779 |
| Overall | 600 | 95.3% | 0.862 | 94.2% | 0.842 |
Binary confusion matrix (Llama vs Claude, overall):
| | Claude: benign | Claude: misinfo |
|---|---|---|
| **Llama: benign** | 457 | 5 |
| **Llama: misinfo** | 23 | 115 |
## Citation
If you use this dataset, please cite:
```bibtex
@misc{puduppully2026fineweb-edu-misinfo,
author = {Puduppully, Ratish},
title = {FineWeb-Edu Misinformation Audit},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/ratishsp/fineweb-edu-misinfo}
}
```
提供机构:
ratishsp



