Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection

Figshare2025-07-11 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Aggregated_Fake_News_Corpus_for_X-FRAME_Preprocessed_Multi-Domain_Dataset_for_Explainable_Misinformation_Detection/29539820

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is associated with the research article titled:"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"This corpus aggregates, harmonizes, and standardizes data from eight widely used fake news datasets. It supports multi-domain fake news detection with emphasis on explainability, cross-modal generalization, and robust performance.🗂️ Dataset ContentsThis repository contains the following resources:Aggregated Raw Corpus (aggregated_raw.csv)286,260 samples across 8 datasets.Binary labels (1 = Fake, 0 = Real)Includes metadata: source dataset, topic (if available), speaker/source, etc.Preprocessed Text Corpus (aggregated_cleaned.csv)Includes standardized and cleaned cleaned_text column.Text normalization applied using SpaCy (lowercasing, lemmatization, punctuation/URL/user removal).Fully Encoded Feature Matrix (xframe_features_encoded.csv)104 structured features derived from communication theory and media psychology.Includes source encoding, speaker credibility, social engagement, sentiment, subjectivity, sensationalism, and readability scores.All numerical features scaled to [0, 1]; categorical features one-hot encoded.Data Splitstrain.csv, val.csv, test.csv: Stratified splits of the cleaned and encoded data.Feature Metadata (feature_description.pdf)Documentation of all 104 features with descriptions, data sources, and rationales.🔧 Preprocessing OverviewTo ensure robust and generalizable modeling, the following standardized pipeline was applied:Text Preprocessing: Cleaned using SpaCy, lowercased, lemmatized, and stripped of stopwords, URLs, and usernames.Label Mapping:Datasets with multiclass labels (e.g., LIAR, FNC-1) were mapped to a unified binary schema using theory-informed rules.1 = Fake includes false, pants-on-fire, disagree, etc.; 0 = Real includes true, agree, mostly-true.Deduplication: Removed near-duplicate entries across datasets using fuzzy string matching and content hashing.Feature Engineering:Source credibility features (e.g., speaker credibility from LIAR).Social context (e.g., tweet volume, user mentions).Framing indicators (e.g., sentiment, subjectivity, sensationalism, readability).Feature Encoding: One-hot encoding for categorical attributes, Min-Max scaling for numerical features.📚 Original Data SourcesThis aggregated corpus was derived from the following datasets. Please cite them individually alongside this collection:LIAR – Wang (2017): https://doi.org/10.18653/v1/P17-2067FakeNewsNet (PolitiFact, BuzzFeed, GossipCop) – Shu et al.: https://doi.org/10.1145/3363574ISOT – Ahmed et al.: https://doi.org/10.48550/arXiv.1708.07104WELFake – Verma et al.: https://doi.org/10.1109/TCSS.2021.3068519FNC-1 – https://www.fakenewschallenge.org/FakeNewsAMT – Pérez-Rosas et al.: https://doi.org/10.18653/v1/C18-1287Celebrity Rumors – Horne & Adalı: https://doi.org/10.1609/icwsm.v11i1.15015PHEME – Zubiaga et al.: https://doi.org/10.6084/m9.figshare.4010619.v1📖 How to Cite This DatasetNwaiwu, S.; Jongsawat, N.; Tungkasthan, A. Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection. Appl. Sci. 2025, 15, 9498. https://doi.org/10.3390/app15179498

创建时间：

2025-07-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集