Wikipedia-based datasets

Name: Wikipedia-based datasets
Creator: Wikipedia
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/epfl-dlab/pti-candgen

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是从维基百科文章中创建的，旨在评估在低资源语言情境下的候选生成方法。它不仅包含了全面的评估信息，可以用于分析不同学习设置（零样本学习和联合学习）下的各种候选生成方法，而且还覆盖了9对语言（从低资源语言到高、中、低资源枢纽语言）。该数据集的主要任务是实体链接和候选生成。

This dataset is curated from Wikipedia articles, designed to evaluate candidate generation methods in low-resource language contexts. It includes comprehensive evaluation information that enables analysis of various candidate generation approaches across different learning settings, namely zero-shot learning and joint learning, and covers 9 language pairs ranging from low-resource languages to high-, medium-, and low-resource hub languages. The primary tasks of this dataset are entity linking and candidate generation.

提供机构：

Wikipedia

5,000+

优质数据集

54 个

任务类型

进入经典数据集