five

Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection

收藏
Figshare2025-09-24 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Aggregated_Fake_News_Corpus_for_X-FRAME_Preprocessed_Multi-Domain_Dataset_for_Explainable_Misinformation_Detection/29539820/1
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset is associated with the research article titled:<b>"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"</b>This corpus aggregates, harmonizes, and standardizes data from <b>eight widely used fake news datasets</b>. It supports multi-domain fake news detection with emphasis on <b>explainability</b>, <b>cross-modal generalization</b>, and <b>robust performance</b>.📂 Included DatasetsThe following original datasets are incorporated:<b>LIAR</b> – Political fact-checked claims (PolitiFact)<b>FakeNewsNet</b> – News + social media from <b>PolitiFact</b>, <b>BuzzFeed</b>, and <b>GossipCop</b><b>ISOT Fake News Dataset</b> – Real/fake news articles<b>WELFake</b> – Automatically generated and real news<b>FNC-1 (Fake News Challenge)</b> – Stance-labeled news headline-body pairs<b>Celebrity Rumors</b> – Real/fake celebrity news<b>FakeNewsAMT</b> – Crowdsourced fake/real news headlines<b>PHEME</b> – Rumor cascades from Twitter🔧 Preprocessing StepsEach dataset was preprocessed using a unified pipeline:<b>Text Cleaning</b>: Token normalization, lowercasing, punctuation removal, URL and username stripping.<b>Label Harmonization</b>: All datasets were mapped to a <b>binary schema</b>:<br><code>1 = Fake</code>, <code>0 = Real</code>.Multi-class datasets (e.g., LIAR, FNC-1) were carefully re-labeled using rules aligned with original paper guidelines.<b>Deduplication</b>: Duplicate content across datasets was removed using fuzzy string matching and content hashing.<b>Metadata Retention</b>: Source dataset names, topic labels, and speaker/source fields were retained where available.<b>Corpus Merging</b>: All datasets were combined into a single <code>.csv</code> file with the following columns:<code>text</code>: News content or claim<code>label</code>: 1 (Fake) or 0 (Real)<code>source_dataset</code>: Original dataset name<code>topic</code> (if available)<code>speaker</code> or <code>source</code> (if applicable)Additional metadata (optional)The final corpus contains <b>286,260 unique samples</b>, balanced across news sources, domains, and platforms (formal vs. social media).📘 Citation GuidelinesPlease cite this aggregated corpus as:Nwaiwu, S., Jongsawat, N., &amp; Tungkasthan, A. (2025). <i>Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection</i>. Figshare. https://doi.org/[YOUR DOI]In addition, users are encouraged to cite the original datasets individually. References:Wang (2017) – LIAR: https://doi.org/10.18653/v1/P17-2067Shu et al. – FakeNewsNet: https://doi.org/10.1145/3363574Ahmed et al. – ISOT: https://doi.org/10.48550/arXiv.1708.07104WELFake Dataset: https://doi.org/10.1109/TCSS.2021.3068519FNC-1: https://www.fakenewschallenge.org/Pérez-Rosas et al. – FakeNewsAMT: https://doi.org/10.18653/v1/C18-1287Zubiaga et al. – PHEME: https://doi.org/10.6084/m9.figshare.4010619.v1Horne &amp; Adalı – Celebrity Rumors: https://doi.org/10.1609/icwsm.v11i1.15015
提供机构:
Nwaiwu, Steve
创建时间:
2025-07-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作