A Multilingual & Multimodal Text and Image Corpus Dataset for Political Misinformation
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/mzwyhj52tf
下载链接
链接失效反馈官方服务:
资源简介:
Our database is a richly annotated multimodal database designed to facilitate strong fake-news detection research. It consists of two complementary but separate components: an image directory and a text spreadsheet. The image directory consists of a folder-level organization with a title as a topic; within each topic directory, the images are then placed in real and fake subdirectories based on the expert labeling. Such an organization allows loading and processing images for cross-modal testing or supervised learning. In contrast, text data are kept in a single Excel sheet where a record is one piece of news. Four separate columns keep the title, source, full news report, and real/fake indicator. Together, these modalities cover a broad range of temporal and topical domains not only social-media posts, mainstream-media news reports, and election-related posts but allowing the training of models on both linguistic aspects (sensational or objective tone, grammaticality, metadata quality) and visual aspects (original vs. photo-manipulated images). With a combination of a sparse folder hierarchy for images and a richly annotated spreadsheet for text, the dataset is well-specified, reproducible, and easy to pipe into any subsequent machine-learning pipeline.
本数据库为经过精细化标注的多模态(multimodal)数据集,旨在为高性能虚假新闻检测研究提供有力支撑。该数据集包含两个互补且独立的组成部分:图像目录与文本表格。
图像目录采用稀疏层级文件夹结构,以主题作为文件夹命名;每个主题文件夹下,依据专家标注将图像划分为「真实」与「虚假」两个子目录。该结构可便捷加载并处理图像,用于跨模态(cross-modal)测试或监督学习(supervised learning)任务。
与之相对,文本数据存储于单个Excel工作表中,每条记录对应一则新闻。工作表包含四个独立列,分别存储新闻标题、新闻来源、完整新闻内容以及真伪标注标签。
两类模态覆盖了广泛的时间与主题范畴,涵盖社交媒体帖文、主流媒体新闻报道以及选举相关帖文等多种类型,支持模型从语言维度(如煽动性/客观性语调、语法规范性、元数据质量)与视觉维度(原始图像与经过后期篡改的图像)开展训练。
该数据集结合了图像的稀疏层级文件夹结构与文本的精细化标注表格,具备定义清晰、可复现性强的特点,且可便捷接入各类后续机器学习流水线(machine-learning pipeline)。
创建时间:
2025-11-17



