Bharat2004/fake-news-dataset

Name: Bharat2004/fake-news-dataset
Creator: Bharat2004
Published: 2026-04-17 13:22:48
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Bharat2004/fake-news-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 pretty_name: "Fake News Dataset" dataset_name: "fake-news-dataset" tags: - fake-news - misinformation - fact-checking - news-classification - binary-classification size: "~9.25M samples" size_categories: - "1M<n<10M" task_categories: - text-classification task_ids: - fact-checking multilinguality: - monolingual annotations_creators: - found source_datasets: - ISOT - Kaggle - various public fake-news corpora dataset_info: features: - name: text dtype: string - name: label_binary dtype: int64 - name: label_6class dtype: int64 - name: source dtype: string splits: - name: train num_examples: 9249050 dataset_size: 4420000000 homepage: "https://huggingface.co/datasets/Arko007/fake-news-dataset" license_review_notes: "This aggregated dataset is distributed from this repository under the Apache-2.0 license. Upstream components (ISOT, Kaggle datasets, news publishers, Wikipedia-derived content, etc.) may carry different licenses (e.g., CC BY-SA) or redistribution restrictions. The maintainer is responsible for verifying license compatibility and preserving required attribution metadata for each upstream component; users must also confirm rights before redistribution." --- # Fake News Dataset A large-scale, multi-source English-language dataset for binary fake-news detection. The release contains ~9.25M curated and deduplicated samples intended primarily for training robust text-classification models on long-form news articles and headlines. Key facts - Size: ~9.25 million samples (train split) - File provided: train.csv (~4.42 GB) - Labels: binary `label_binary` — 0 = FAKE, 1 = REAL - Language: English - Primary task: Binary text classification / fact-checking - Curator: Arko007 - Dataset landing page: https://huggingface.co/datasets/Arko007/fake-news-dataset Quick start 1. Download train.csv from the dataset repo or HF dataset page. 2. Stream with the Hugging Face `datasets` library or load as Parquet/CSV locally. 3. When training, use stratified sampling or class weights due to heavy class imbalance (FAKE ≈ 3%). Example (load CSV with datasets): ```python from datasets import load_dataset ds = load_dataset("csv", data_files="train.csv", split="train") print(ds.column_names) print(ds[0]) ``` ## Dataset Description This dataset aggregates multiple public fake-news corpora, curated publisher content, and Kaggle collections into a single, standardized CSV suitable for large-scale model training. It was assembled to support domain-adaptive pretraining and large-scale fine-tuning (for example, the reference checkpoint Arko007/fake-news-roberta-5M). Motivation - Provide a large, diverse training corpus for detecting misinformation in news articles. - Support multi-stage transfer learning (news-adaptive pretraining → task fine-tuning → specialized models such as political fact-checkers). Composition - Combined and deduplicated sources totaling ~9.25M rows. - Representative class distribution (from the 5M subset used for analysis): - FAKE: ~2.8% - REAL: ~97.2% Note on provenance - To avoid metadata inconsistencies, this public release contains the two canonical fields (`text`, `label_binary`) in the primary `train.csv`. - Per-record provenance (original source identifiers) is not included in the primary CSV for this release. Aggregated source statistics and provenance mapping files (if available) are provided in the repository under `provenance/` or can be requested from the maintainer. This decision avoids exposing third-party metadata that may carry additional redistribution constraints. ## Dataset Structure Files in this release - train.csv — main CSV file with the fields below (primary artifact; ~4.42 GB) Primary columns - text (string): article body, headline, or statement text - label_binary (int): 0 = FAKE, 1 = REAL Optional supplemental artifacts (in repo) - provenance statistics and aggregation summaries (CSV/TSV) - preprocessing and merging scripts - checksums for distributed artifacts Splits - train: single split included with ~9.25M rows. - No official validation/test splits are included by default. See "Recommended Splits" below. ## Dataset Creation Curation steps 1. Collected multiple public datasets (ISOT, Kaggle fake-news datasets, COVID-19 misinformation corpora, political fake-news collections, scientific claim datasets). 2. Standardized fields to a unified schema (`text`, `label_binary`). 3. Performed duplicate and near-duplicate detection and removal across sources. 4. Cleaned text (unicode normalization, whitespace normalization, basic HTML artifact removal). URLs and certain boilerplate tokens were removed/normalized during preprocessing. 5. Mapped source-specific labels to binary labels: FAKE (0) and REAL (1). Multi-class truthfulness labels from some sources were conservatively mapped to binary assignments during mapping. 6. Assembled final `train.csv` and recorded aggregated source statistics. Preprocessing details - Lowercasing and optional punctuation normalization used for analysis; raw release text retains original casing by default (users may re-normalize). - Minimal PII redaction in the general pipeline — users deploying models in production should audit and remove sensitive content as required by their policies. License & provenance - This repository declares Apache-2.0. Upstream source licenses vary; maintainers must ensure that redistribution under Apache-2.0 is compatible with upstream terms and preserve required notices when applicable. ## Uses Recommended - Training transformer-based classifiers (RoBERTa, DeBERTa, etc.) for article-level fake-news detection. - Domain-adaptive pretraining for news and misinformation tasks. - Baselines and research into imbalance mitigation, calibration, and domain transfer. Worked examples / reference - RoBERTa-base fine-tuned on a 5M subset (Arko007/fake-news-roberta-5M) achieved a validation accuracy of ~99.28% on in-domain news tasks. - Transfer to LIAR (political statements) via further fine-tuning achieved ~71% accuracy for short-statement classification. Not recommended - Legal judgments of truth or presenting automated outputs as authoritative. - Out-of-domain short-text tasks (consider a LIAR- or claims-focused dataset for that use). ## Recommended Splits & Training Advice Suggested default split (if you need validation/test) - Stratified 80/10/10 split by `label_binary` (preserve class proportion). - Train: 80% - Validation: 10% - Test: 10% Because FAKE is rare (~3%), use one or more of these strategies: - Class weighting in the loss (e.g., inverse-frequency weighting) - Oversampling FAKE class or synthetic augmentation for FAKE examples - Hard negative mining or curriculum learning - Evaluate with metrics robust to imbalance (F1, precision/recall per class, AUPRC) For temporal generalization tests - Consider chronological splits (train on older news, validate/test on more recent months/years) to simulate real-world deployment shifts. ## Bias, Risks, and Limitations Known biases - Source bias: ISOT and other major components can create publisher/style cues that models may exploit rather than learning content-level signals. - Class imbalance: Extremely skewed toward REAL; naive accuracy is a poor metric. - Domain & geographic bias: Over-representation of US/Western sources and political topics in some source subsets. - Temporal bias: Data distribution may reflect particular time periods and not future misinformation strategies. Risks - High in-domain accuracy does not guarantee safe performance in the wild; adversarial and out-of-domain fake news can bypass models trained on these sources. - Models may unduly rely on publisher/domain signals, leading to false positives/negatives when applied to new sources. - Redistribution and legal risk for copyrighted publisher content — verify rights. Limitations - English-only dataset; not suitable for multilingual detection tasks. - Binary labels remove nuance—no fine-grained truthfulness scale. - Per-record provenance is not included in the primary `train.csv` release (see notes above). Mitigation recommendations - Evaluate on diverse and out-of-domain benchmarks (e.g., LIAR, PolitiFact corpora). - Use human-in-the-loop verification for high-stakes applications. - Report per-class precision/recall and calibration; prefer F1 and AUPRC over raw accuracy. - Maintain provenance and be prepared to remove items if upstream rights owners request it. ## Distribution & Storage - Primary artifact: train.csv (~4.42 GB) - Prefer streaming/Parquet or chunked CSV loading for training at scale. - Repository contains checksums (when available) and recommended ingestion scripts. ## Contact & Maintainer - Maintainer: Anamitra-Sarkar (GitHub: https://github.com/Anamitra-Sarkar) - Hugging Face dataset page: https://huggingface.co/datasets/Arko007/fake-news-dataset - For licensing or provenance concerns, open an issue on the dataset repo or contact the maintainer via the GitHub profile. ## Citation If you use this dataset, please cite both the curated dataset and the original sources. Curated dataset citation ``` @dataset{arko007_fakenews, title = {Fake News Dataset: A Large-Scale Multi-Source Collection}, author = {Arko007}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/Arko007/fake-news-dataset} } ``` Representative source citations ``` @article{ahmed2017isot, title={ISOT Fake News Dataset}, author={Ahmed, Hadeer and Traore, Issa and Saad, Sherif}, year={2017} } @inproceedings{wang2017liar, title={"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection}, author={Wang, William Yang}, booktitle={ACL}, year={2017} } ``` ## Changelog - v1.0 — corrected metadata and filenames; primary release includes `train.csv` (~9.25M rows) — 2025-10-18 - Notes in this version: - Replaced `num_rows` with `num_examples` to be compatible with the datasets library split schema. - Updated top-level license to Apache-2.0 and adjusted license review notes. - Ensured filename references match the actual artifact `train.csv`. - Added Quick Start and recommended stratified splits.

提供机构：

Bharat2004

5,000+

优质数据集

54 个

任务类型

进入经典数据集