Provo Corpus
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/evgeniael/predict_next_word.git
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为Provo语料库,包含了来自新闻、小说和科学等不同来源的55个英文短文,平均每篇短文约50个单词。每个短文的起始序列被提供给人类,用于进行单词补全预测。此外,该数据集还用于估算目标条件概率分布(CPDs),并分析模型对人类不确定性的校准情况。该数据集的规模为55个短文,包含2687个起始序列,所涉及的任务是单词补全预测。
This dataset, named the Provo Corpus, includes 55 short English texts from diverse sources such as news, fiction, and scientific literature, with an average length of approximately 50 words per text. Starting prefixes of each short text are provided to human participants for word completion prediction tasks. Additionally, this corpus is used to estimate target conditional probability distributions (CPDs) and analyze the calibration of model uncertainty against human uncertainty. In total, the dataset contains 2687 starting prefixes, with the core task being word completion prediction.



