five

ratishsp/fineweb-edu-misinfo

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ratishsp/fineweb-edu-misinfo
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification language: - en tags: - misinformation - content-safety - fineweb - pretraining-data - data-quality size_categories: - 100K<n<1M --- # FineWeb-Edu Misinformation Audit A dataset of 200K documents from [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) annotated for misinformation content. FineWeb-Edu uses a Snowflake-arctic-embed embedding model classifier to select "web pages of educational value", but the classifier optimizes for surface-level markers of educational writing (structure, citations, academic tone) and cannot assess whether content is factually accurate or ideologically motivated. ## Dataset composition - **100K documents from known problematic domains** across 7 categories: pseudoscience, climate denial, conspiracy, antivax/medical misinformation, propaganda, hate/extremism, and Holocaust denial - **100K randomly sampled documents** from FineWeb-Edu to establish background rates All documents were annotated by Llama 4 Maverick (meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8) on Together AI for misinformation content. ## Fields | Field | Description | |---|---| | `url` | Original URL from FineWeb-Edu | | `domain` | Domain name extracted from URL (e.g. naturalnews.com) | | `domain_category` | Source domain category (null for random sample) | | `edu_score` | FineWeb-Edu classifier score | | `text` | Full document text | | `llama_label` | Annotation: benign, health_misinfo, pseudoscience, climate_denial, conspiracy_propaganda, hate_extremism | | `llama_confidence` | Annotator confidence (high/medium/low) | | `llama_reason` | Free-text reasoning | ## Key findings - **4.1% of randomly sampled FineWeb-Edu documents** contain misinformation (health misinformation, pseudoscience, conspiracy theories, etc.) - **39% of documents from known problematic domains** contain misinformation content, nearly 10x the overall rate - The full FineWeb-Edu dataset contains 1.53 billion documents. Extrapolating the 4.1% rate, approximately **63 million documents** may contain misinformation. From known problematic domains alone, an estimated **3 million documents** contain confirmed misinformation content. ## Label distribution ### Flagged domains (100K) | Label | Count | |---|---| | benign | 60,804 | | pseudoscience | 16,179 | | health_misinfo | 10,142 | | climate_denial | 5,393 | | conspiracy_propaganda | 4,878 | | hate_extremism | 2,261 | ### Random sample (100K) | Label | Count | |---|---| | benign | 95,865 | | health_misinfo | 2,087 | | pseudoscience | 1,123 | | conspiracy_propaganda | 475 | | hate_extremism | 236 | | climate_denial | 199 | ## Inter-annotator agreement 600 documents (300 flagged, 300 random) were independently annotated by both Llama 4 Maverick and Claude Sonnet 4.6 using the same prompt. Agreement was measured using Cohen's kappa. | Subset | n | Binary agreement | Binary kappa | Multiclass agreement | Multiclass kappa | |---|---|---|---|---|---| | Flagged | 300 | 92.0% | 0.831 | 90.3% | 0.832 | | Random | 300 | 98.7% | 0.850 | 98.0% | 0.779 | | Overall | 600 | 95.3% | 0.862 | 94.2% | 0.842 | Binary confusion matrix (Llama vs Claude, overall): | | Claude: benign | Claude: misinfo | |---|---|---| | **Llama: benign** | 457 | 5 | | **Llama: misinfo** | 23 | 115 | ## Citation If you use this dataset, please cite: ```bibtex @misc{puduppully2026fineweb-edu-misinfo, author = {Puduppully, Ratish}, title = {FineWeb-Edu Misinformation Audit}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/ratishsp/fineweb-edu-misinfo} } ```
提供机构:
ratishsp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作