MultiCheckWorthy (MultiCW) dataset

Name: MultiCheckWorthy (MultiCW) dataset
Creator: Zenodo
Published: 2025-09-04 08:32:46
License: 暂无描述

Zenodo2025-09-04 更新2026-05-26 收录

下载链接：

https://zenodo.org/doi/10.5281/zenodo.17019686

下载链接

链接失效反馈

官方服务：

资源简介：

The MultiCheckWorthy (MultiCW) dataset is a balanced multilingual benchmarking dataset for a check-worthy claim detection, covering 16 languages, 6 topical domains, and 2 writing styles. The dataset consists of 123,722 samples, evenly distributed between noisy and structured texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. Each claim is accompanied by its English translation, detected topic, writing style, language code, check-worthyness label as well as the list of detected named entities. The dataset was composed of existing datasets and balanced by translating the samples from the existing datasets as well as using the samples collected from Wikipedia. The dataset is partitioned into training, validation, and test set. In addition, we construct a separate out-of-distribution (OOD) set consisting of 4 other languages (it, mk, nl and my), to evaluate model generalization beyond the in-distribution data. Bellow is the number of samples included in each set: Set Samples Train 86,691 Validation 18,491 Test 18,540 Out-of-distribution (OOD) 29,647

提供机构：

Zenodo

创建时间：

2025-09-03

5,000+

优质数据集

54 个

任务类型

进入经典数据集