PolyglotFakeFacts: A multilingual dataset of fake and real news across politics, security, and social domains_v2.0

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/8yfrm6z9dx

下载链接

链接失效反馈

官方服务：

资源简介：

PolyglotFakeFacts v2.0 is an updated and expanded version of the original dataset (V1, DOI: 10.17632/gff8bmr4ff.1), published in February 2026. This version incorporates newly obtained data, providing an enriched and more comprehensive multilingual resource for fake and real news detection research. The dataset now comprises 10,206 articles in total — 4,912 fake news entries and 5,294 real/true news entries — collected from online sources across 18 languages: Arabic, Armenian, Azerbaijani, Bulgarian, Czech, English, Finnish, French, Georgian, German, Hungarian, Italian, Lithuanian, Romanian, Russian, Slovak, Spanish, and Swedish. Each article is available both in its original language and in an English translated version. To ensure balance, real news was collected from sources covering each of the 18 languages represented in the fake news subset. PolyglotFakeFacts is a multilingual dataset designed to support research on the detection of fake and real news across diverse domains such as politics, geopolitics, security, social issues, and military affairs. Fake news articles were sourced from outlets flagged by EUvsDisinfo — the flagship project of the East StratCom Task Force within the EEAS (European External Action Service) — while real news was curated from official and editorially credible sources in each represented language. The research hypothesis underpinning this dataset is that linguistic and contextual markers of misinformation can be systematically identified across multiple languages, enabling the development of more robust and generalizable fake news detection models. Among the key findings is that fake news articles often display recurring linguistic and structural patterns regardless of the language, while real news tends to follow more standardized journalistic conventions. This suggests that multilingual approaches to fake news detection could leverage both cross-linguistic similarities and domain-specific features. Each entry in the dataset is structured around ten fields: gathering date, news date, URL, source name, language, keywords, original headline, original text, English-translated text, and label (fake/non-fake). All samples were pre-processed to ensure consistent formatting and removal of duplicates. This dataset can be interpreted and used by researchers aiming to: train and evaluate machine learning and deep learning models for fake news classification, perform cross-lingual and multilingual comparative studies, and investigate the linguistic, semantic, and thematic characteristics of misinformation. By providing a curated, multilingual, and domain-diverse resource, PolyglotFakeFacts enables the community to develop more transparent, explainable, and resilient AI models for combating online misinformation.

创建时间：

2026-02-17