Bharat2004/fake-news-dataset
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Bharat2004/fake-news-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
pretty_name: "Fake News Dataset"
dataset_name: "fake-news-dataset"
tags:
- fake-news
- misinformation
- fact-checking
- news-classification
- binary-classification
size: "~9.25M samples"
size_categories:
- "1M<n<10M"
task_categories:
- text-classification
task_ids:
- fact-checking
multilinguality:
- monolingual
annotations_creators:
- found
source_datasets:
- ISOT
- Kaggle
- various public fake-news corpora
dataset_info:
features:
- name: text
dtype: string
- name: label_binary
dtype: int64
- name: label_6class
dtype: int64
- name: source
dtype: string
splits:
- name: train
num_examples: 9249050
dataset_size: 4420000000
homepage: "https://huggingface.co/datasets/Arko007/fake-news-dataset"
license_review_notes: "This aggregated dataset is distributed from this repository under the Apache-2.0 license. Upstream components (ISOT, Kaggle datasets, news publishers, Wikipedia-derived content, etc.) may carry different licenses (e.g., CC BY-SA) or redistribution restrictions. The maintainer is responsible for verifying license compatibility and preserving required attribution metadata for each upstream component; users must also confirm rights before redistribution."
---
# Fake News Dataset
A large-scale, multi-source English-language dataset for binary fake-news detection. The release contains ~9.25M curated and deduplicated samples intended primarily for training robust text-classification models on long-form news articles and headlines.
Key facts
- Size: ~9.25 million samples (train split)
- File provided: train.csv (~4.42 GB)
- Labels: binary `label_binary` — 0 = FAKE, 1 = REAL
- Language: English
- Primary task: Binary text classification / fact-checking
- Curator: Arko007
- Dataset landing page: https://huggingface.co/datasets/Arko007/fake-news-dataset
Quick start
1. Download train.csv from the dataset repo or HF dataset page.
2. Stream with the Hugging Face `datasets` library or load as Parquet/CSV locally.
3. When training, use stratified sampling or class weights due to heavy class imbalance (FAKE ≈ 3%).
Example (load CSV with datasets):
```python
from datasets import load_dataset
ds = load_dataset("csv", data_files="train.csv", split="train")
print(ds.column_names)
print(ds[0])
```
## Dataset Description
This dataset aggregates multiple public fake-news corpora, curated publisher content, and Kaggle collections into a single, standardized CSV suitable for large-scale model training. It was assembled to support domain-adaptive pretraining and large-scale fine-tuning (for example, the reference checkpoint Arko007/fake-news-roberta-5M).
Motivation
- Provide a large, diverse training corpus for detecting misinformation in news articles.
- Support multi-stage transfer learning (news-adaptive pretraining → task fine-tuning → specialized models such as political fact-checkers).
Composition
- Combined and deduplicated sources totaling ~9.25M rows.
- Representative class distribution (from the 5M subset used for analysis):
- FAKE: ~2.8%
- REAL: ~97.2%
Note on provenance
- To avoid metadata inconsistencies, this public release contains the two canonical fields (`text`, `label_binary`) in the primary `train.csv`.
- Per-record provenance (original source identifiers) is not included in the primary CSV for this release. Aggregated source statistics and provenance mapping files (if available) are provided in the repository under `provenance/` or can be requested from the maintainer. This decision avoids exposing third-party metadata that may carry additional redistribution constraints.
## Dataset Structure
Files in this release
- train.csv — main CSV file with the fields below (primary artifact; ~4.42 GB)
Primary columns
- text (string): article body, headline, or statement text
- label_binary (int): 0 = FAKE, 1 = REAL
Optional supplemental artifacts (in repo)
- provenance statistics and aggregation summaries (CSV/TSV)
- preprocessing and merging scripts
- checksums for distributed artifacts
Splits
- train: single split included with ~9.25M rows.
- No official validation/test splits are included by default. See "Recommended Splits" below.
## Dataset Creation
Curation steps
1. Collected multiple public datasets (ISOT, Kaggle fake-news datasets, COVID-19 misinformation corpora, political fake-news collections, scientific claim datasets).
2. Standardized fields to a unified schema (`text`, `label_binary`).
3. Performed duplicate and near-duplicate detection and removal across sources.
4. Cleaned text (unicode normalization, whitespace normalization, basic HTML artifact removal). URLs and certain boilerplate tokens were removed/normalized during preprocessing.
5. Mapped source-specific labels to binary labels: FAKE (0) and REAL (1). Multi-class truthfulness labels from some sources were conservatively mapped to binary assignments during mapping.
6. Assembled final `train.csv` and recorded aggregated source statistics.
Preprocessing details
- Lowercasing and optional punctuation normalization used for analysis; raw release text retains original casing by default (users may re-normalize).
- Minimal PII redaction in the general pipeline — users deploying models in production should audit and remove sensitive content as required by their policies.
License & provenance
- This repository declares Apache-2.0. Upstream source licenses vary; maintainers must ensure that redistribution under Apache-2.0 is compatible with upstream terms and preserve required notices when applicable.
## Uses
Recommended
- Training transformer-based classifiers (RoBERTa, DeBERTa, etc.) for article-level fake-news detection.
- Domain-adaptive pretraining for news and misinformation tasks.
- Baselines and research into imbalance mitigation, calibration, and domain transfer.
Worked examples / reference
- RoBERTa-base fine-tuned on a 5M subset (Arko007/fake-news-roberta-5M) achieved a validation accuracy of ~99.28% on in-domain news tasks.
- Transfer to LIAR (political statements) via further fine-tuning achieved ~71% accuracy for short-statement classification.
Not recommended
- Legal judgments of truth or presenting automated outputs as authoritative.
- Out-of-domain short-text tasks (consider a LIAR- or claims-focused dataset for that use).
## Recommended Splits & Training Advice
Suggested default split (if you need validation/test)
- Stratified 80/10/10 split by `label_binary` (preserve class proportion).
- Train: 80%
- Validation: 10%
- Test: 10%
Because FAKE is rare (~3%), use one or more of these strategies:
- Class weighting in the loss (e.g., inverse-frequency weighting)
- Oversampling FAKE class or synthetic augmentation for FAKE examples
- Hard negative mining or curriculum learning
- Evaluate with metrics robust to imbalance (F1, precision/recall per class, AUPRC)
For temporal generalization tests
- Consider chronological splits (train on older news, validate/test on more recent months/years) to simulate real-world deployment shifts.
## Bias, Risks, and Limitations
Known biases
- Source bias: ISOT and other major components can create publisher/style cues that models may exploit rather than learning content-level signals.
- Class imbalance: Extremely skewed toward REAL; naive accuracy is a poor metric.
- Domain & geographic bias: Over-representation of US/Western sources and political topics in some source subsets.
- Temporal bias: Data distribution may reflect particular time periods and not future misinformation strategies.
Risks
- High in-domain accuracy does not guarantee safe performance in the wild; adversarial and out-of-domain fake news can bypass models trained on these sources.
- Models may unduly rely on publisher/domain signals, leading to false positives/negatives when applied to new sources.
- Redistribution and legal risk for copyrighted publisher content — verify rights.
Limitations
- English-only dataset; not suitable for multilingual detection tasks.
- Binary labels remove nuance—no fine-grained truthfulness scale.
- Per-record provenance is not included in the primary `train.csv` release (see notes above).
Mitigation recommendations
- Evaluate on diverse and out-of-domain benchmarks (e.g., LIAR, PolitiFact corpora).
- Use human-in-the-loop verification for high-stakes applications.
- Report per-class precision/recall and calibration; prefer F1 and AUPRC over raw accuracy.
- Maintain provenance and be prepared to remove items if upstream rights owners request it.
## Distribution & Storage
- Primary artifact: train.csv (~4.42 GB)
- Prefer streaming/Parquet or chunked CSV loading for training at scale.
- Repository contains checksums (when available) and recommended ingestion scripts.
## Contact & Maintainer
- Maintainer: Anamitra-Sarkar (GitHub: https://github.com/Anamitra-Sarkar)
- Hugging Face dataset page: https://huggingface.co/datasets/Arko007/fake-news-dataset
- For licensing or provenance concerns, open an issue on the dataset repo or contact the maintainer via the GitHub profile.
## Citation
If you use this dataset, please cite both the curated dataset and the original sources.
Curated dataset citation
```
@dataset{arko007_fakenews,
title = {Fake News Dataset: A Large-Scale Multi-Source Collection},
author = {Arko007},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Arko007/fake-news-dataset}
}
```
Representative source citations
```
@article{ahmed2017isot,
title={ISOT Fake News Dataset},
author={Ahmed, Hadeer and Traore, Issa and Saad, Sherif},
year={2017}
}
@inproceedings{wang2017liar,
title={"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection},
author={Wang, William Yang},
booktitle={ACL},
year={2017}
}
```
## Changelog
- v1.0 — corrected metadata and filenames; primary release includes `train.csv` (~9.25M rows) — 2025-10-18
- Notes in this version:
- Replaced `num_rows` with `num_examples` to be compatible with the datasets library split schema.
- Updated top-level license to Apache-2.0 and adjusted license review notes.
- Ensured filename references match the actual artifact `train.csv`.
- Added Quick Start and recommended stratified splits.
提供机构:
Bharat2004



