five

Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection

收藏
DataCite Commons2025-09-24 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/Aggregated_Fake_News_Corpus_for_X-FRAME_Preprocessed_Multi-Domain_Dataset_for_Explainable_Misinformation_Detection/29539820
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is associated with the research article titled:<b>"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"</b>This corpus aggregates, harmonizes, and standardizes data from <b>eight widely used fake news datasets</b>. It supports multi-domain fake news detection with emphasis on <b>explainability</b>, <b>cross-modal generalization</b>, and <b>robust performance</b>.🗂️ Dataset ContentsThis repository contains the following resources:<b>Aggregated Raw Corpus (</b><code><strong>aggregated_raw.csv</strong></code><b>)</b>286,260 samples across 8 datasets.Binary labels (<code>1 = Fake</code>, <code>0 = Real</code>)Includes metadata: source dataset, topic (if available), speaker/source, etc.<b>Preprocessed Text Corpus (</b><code><strong>aggregated_cleaned.csv</strong></code><b>)</b>Includes standardized and cleaned <code>cleaned_text</code> column.Text normalization applied using SpaCy (lowercasing, lemmatization, punctuation/URL/user removal).<b>Fully Encoded Feature Matrix (</b><code><strong>xframe_features_encoded.csv</strong></code><b>)</b>104 structured features derived from communication theory and media psychology.Includes source encoding, speaker credibility, social engagement, sentiment, subjectivity, sensationalism, and readability scores.All numerical features scaled to [0, 1]; categorical features one-hot encoded.<b>Data Splits</b><code>train.csv</code>, <code>val.csv</code>, <code>test.csv</code>: Stratified splits of the cleaned and encoded data.<b>Feature Metadata (</b><code><strong>feature_description.pdf</strong></code><b>)</b>Documentation of all 104 features with descriptions, data sources, and rationales.🔧 Preprocessing OverviewTo ensure robust and generalizable modeling, the following standardized pipeline was applied:<b>Text Preprocessing</b>: Cleaned using SpaCy, lowercased, lemmatized, and stripped of stopwords, URLs, and usernames.<b>Label Mapping</b>:Datasets with multiclass labels (e.g., LIAR, FNC-1) were mapped to a unified binary schema using theory-informed rules.<code>1 = Fake</code> includes false, pants-on-fire, disagree, etc.; <code>0 = Real</code> includes true, agree, mostly-true.<b>Deduplication</b>: Removed near-duplicate entries across datasets using fuzzy string matching and content hashing.<b>Feature Engineering</b>:Source credibility features (e.g., speaker credibility from LIAR).Social context (e.g., tweet volume, user mentions).Framing indicators (e.g., sentiment, subjectivity, sensationalism, readability).<b>Feature Encoding</b>: One-hot encoding for categorical attributes, Min-Max scaling for numerical features.📚 Original Data SourcesThis aggregated corpus was derived from the following datasets. Please cite them individually alongside this collection:<b>LIAR</b> – Wang (2017): https://doi.org/10.18653/v1/P17-2067<b>FakeNewsNet (PolitiFact, BuzzFeed, GossipCop)</b> – Shu et al.: https://doi.org/10.1145/3363574<b>ISOT</b> – Ahmed et al.: https://doi.org/10.48550/arXiv.1708.07104<b>WELFake</b> – Verma et al.: https://doi.org/10.1109/TCSS.2021.3068519<b>FNC-1</b> – https://www.fakenewschallenge.org/<b>FakeNewsAMT</b> – Pérez-Rosas et al.: https://doi.org/10.18653/v1/C18-1287<b>Celebrity Rumors</b> – Horne &amp; Adalı: https://doi.org/10.1609/icwsm.v11i1.15015<b>PHEME</b> – Zubiaga et al.: https://doi.org/10.6084/m9.figshare.4010619.v1📖 How to Cite This DatasetNwaiwu, S.; Jongsawat, N.; Tungkasthan, A. Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection. <i>Appl. Sci.</i> <b>2025</b>, <i>15</i>, 9498. https://doi.org/10.3390/app15179498<br>
提供机构:
figshare
创建时间:
2025-07-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作