sbordt/wikipedia-birthdays-sitelinks20
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sbordt/wikipedia-birthdays-sitelinks20
下载链接
链接失效反馈官方服务:
资源简介:
一个包含75,291位知名人士的姓名、生日及Wikipedia/Wikidata流行度元数据的数据集。旨在作为语言模型的知识探测/幻觉基准:给定一个人的名字,模型能否回忆起其出生年份?数据集分为训练集(55,291条)、验证集(10,000条)和测试集(10,000条),均按sitelinks桶分层。数据直接从Wikidata JSON转储中提取,筛选条件包括实例为人类、有出生日期声明、有英文标签及至少20个sitelinks条目。数据集用于探测语言模型的事实回忆能力,典型使用方式为提问<name>出生于哪一年?并验证模型回答的准确性。数据集还包含sitelinks分布、模型性能指标及注意事项,如未对生日进行质量过滤等。
A dataset of 75,291 notable people with their names, birthdays, and Wikipedia/Wikidata popularity metadata. Intended as a knowledge-probing / hallucination benchmark for language models: given a persons name, can the model recall their birth year? The dataset is split into training (55,291 rows), validation (10,000 rows), and test (10,000 rows) sets, stratified by sitelinks bucket. Extracted directly from the full Wikidata JSON dump, the dataset includes entities that are instances of humans, have a date of birth claim, an English label, and at least 20 sitelinks entries. Designed for probing factual recall in language models with a simple question format (In what year was <name> born?), the dataset also provides sitelinks distribution, model performance metrics, and caveats such as no quality filtering on birthdays beyond year extraction.
提供机构:
sbordt



