Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection

Name: Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection
Creator: figshare
Published: 2025-09-24 21:02:34
License: 暂无描述

DataCite Commons2025-09-24 更新2025-09-08 收录

下载链接：

https://figshare.com/articles/dataset/Aggregated_Fake_News_Corpus_for_X-FRAME_Preprocessed_Multi-Domain_Dataset_for_Explainable_Misinformation_Detection/29539820

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is associated with the research article titled:"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"This corpus aggregates, harmonizes, and standardizes data from eight widely used fake news datasets. It supports multi-domain fake news detection with emphasis on explainability, cross-modal generalization, and robust performance.🗂️ Dataset ContentsThis repository contains the following resources:Aggregated Raw Corpus (<code>aggregated_raw.csv</code>)286,260 samples across 8 datasets.Binary labels (<code>1 = Fake</code>, <code>0 = Real</code>)Includes metadata: source dataset, topic (if available), speaker/source, etc.Preprocessed Text Corpus (<code>aggregated_cleaned.csv</code>)Includes standardized and cleaned <code>cleaned_text</code> column.Text normalization applied using SpaCy (lowercasing, lemmatization, punctuation/URL/user removal).Fully Encoded Feature Matrix (<code>xframe_features_encoded.csv</code>)104 structured features derived from communication theory and media psychology.Includes source encoding, speaker credibility, social engagement, sentiment, subjectivity, sensationalism, and readability scores.All numerical features scaled to [0, 1]; categorical features one-hot encoded.Data Splits<code>train.csv</code>, <code>val.csv</code>, <code>test.csv</code>: Stratified splits of the cleaned and encoded data.Feature Metadata (<code>feature_description.pdf</code>)Documentation of all 104 features with descriptions, data sources, and rationales.🔧 Preprocessing OverviewTo ensure robust and generalizable modeling, the following standardized pipeline was applied:Text Preprocessing: Cleaned using SpaCy, lowercased, lemmatized, and stripped of stopwords, URLs, and usernames.Label Mapping:Datasets with multiclass labels (e.g., LIAR, FNC-1) were mapped to a unified binary schema using theory-informed rules.<code>1 = Fake</code> includes false, pants-on-fire, disagree, etc.; <code>0 = Real</code> includes true, agree, mostly-true.Deduplication: Removed near-duplicate entries across datasets using fuzzy string matching and content hashing.Feature Engineering:Source credibility features (e.g., speaker credibility from LIAR).Social context (e.g., tweet volume, user mentions).Framing indicators (e.g., sentiment, subjectivity, sensationalism, readability).Feature Encoding: One-hot encoding for categorical attributes, Min-Max scaling for numerical features.📚 Original Data SourcesThis aggregated corpus was derived from the following datasets. Please cite them individually alongside this collection:LIAR – Wang (2017): https://doi.org/10.18653/v1/P17-2067FakeNewsNet (PolitiFact, BuzzFeed, GossipCop) – Shu et al.: https://doi.org/10.1145/3363574ISOT – Ahmed et al.: https://doi.org/10.48550/arXiv.1708.07104WELFake – Verma et al.: https://doi.org/10.1109/TCSS.2021.3068519FNC-1 – https://www.fakenewschallenge.org/FakeNewsAMT – Pérez-Rosas et al.: https://doi.org/10.18653/v1/C18-1287Celebrity Rumors – Horne & Adalı: https://doi.org/10.1609/icwsm.v11i1.15015PHEME – Zubiaga et al.: https://doi.org/10.6084/m9.figshare.4010619.v1📖 How to Cite This DatasetNwaiwu, S.; Jongsawat, N.; Tungkasthan, A. Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection. Appl. Sci. 2025, 15, 9498. https://doi.org/10.3390/app15179498

提供机构：

figshare

创建时间：

2025-07-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集