MultiCheckWorthy (MultiCW) dataset
收藏Zenodo2025-09-04 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.17019686
下载链接
链接失效反馈官方服务:
资源简介:
The MultiCheckWorthy (MultiCW) dataset is a balanced multilingual benchmarking dataset for a check-worthy claim detection, covering 16 languages, 6 topical domains, and 2 writing styles. The dataset consists of 123,722 samples, evenly distributed between noisy and structured texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. Each claim is accompanied by its English translation, detected topic, writing style, language code, check-worthyness label as well as the list of detected named entities. The dataset was composed of existing datasets and balanced by translating the samples from the existing datasets as well as using the samples collected from Wikipedia.
The dataset is partitioned into training, validation, and test set. In addition, we construct a separate out-of-distribution (OOD) set consisting of 4 other languages (it, mk, nl and my), to evaluate model generalization beyond the in-distribution data. Bellow is the number of samples included in each set:
Set
Samples
Train
86,691
Validation
18,491
Test
18,540
Out-of-distribution (OOD)
29,647
提供机构:
Zenodo
创建时间:
2025-09-03



