"COMBINED DATASET"
收藏DataCite Commons2026-01-06 更新2026-05-03 收录
下载链接:
https://ieee-dataport.org/documents/combined-dataset
下载链接
链接失效反馈官方服务:
资源简介:
"The dataset used in this study comprises 82,946 labeled news statements collected from multiple publicly available fact-checking and news verification sources focusing on Indian and Indic language content. Each instance is annotated for binary classification, with labels Real (0) and Fake (1). The dataset exhibits a natural class imbalance, consisting of 53,713 Fake samples (64.76%) and 29,233 Real samples (35.24%).The corpus spans a diverse set of languages, including English, Hindi, Tamil, Gujarati, Malayalam, Punjabi, Bengali, Telugu, Marathi, Nepali, and other low-resource languages. It also contains romanized and code-mixed text, reflecting realistic social media usage patterns in multilingual Indian settings. Language identifiers were retained to support language-wise evaluation.Data from different sources were merged into a unified format, retaining only semantically meaningful fields: news text, label, and language. The dataset\u2019s scale, linguistic diversity, and presence of code-mixing make it suitable for evaluating multilingual transformer models for Indic fake news detection."
提供机构:
IEEE DataPort
创建时间:
2026-01-06



